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PREFACE 



This text is a new and improved edition of Rawlings (1988). It is the out- 
growth of several years of teaching an applied regression course to graduate 
students in the sciences. Most of the students in these classes had taken 
a two-semester introduction to statistical methods that included experi- 
mental design and multiple regression at the level provided in texts such 
as Steel, Torrie, and Dickey (1997) and Snedecor and Cochran (1989). For 
most, the multiple regression had been presented in matrix notation. 

The basic purpose of the course and this text is to develop an understand- 
ing of least squares and related statistical methods without becoming exces- 
sively mathematical. The emphasis is on regression concepts, rather than on 
mathematical proofs. Proofs are given only to develop facility with matrix 
algebra and comprehension of mathematical relationships. Good students, 
even though they may not have strong mathematical backgrounds, quickly 
grasp the essential concepts and appreciate the enhanced understanding. 
The learning process is reinforced with continuous use of numerical exam- 
ples throughout the text and with several case studies. Some numerical 
and mathematical exercises are included to whet the appetite of graduate 
students. 

The first four chapters of the book provide a review of simple regression 
in algebraic notation (Chapter 1), an introduction to key matrix operations 
and the geometry of vectors (Chapter 2), and a review of ordinary least 
squares in matrix notation (Chapters 3 and 4). Chapter 4 also provides 
a foundation for the testing of hypotheses and the properties of sums of 
squares used in analysis of variance. Chapter 5 is a case study giving a 
complete multiple regression analysis using the methods reviewed in the 
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first four chapters. Then Chapter 6 gives a brief geometric interpretation 
of least squares illustrating the relationships among the data vectors, the 
link between the analysis of variance and the lengths of the vectors, and 
the role of degrees of freedom. Chapter 7 discusses the methods and crite- 
ria for determining which independent variables should be included in the 
models. The next two chapters include special classes of multiple regres- 
sion models. Chapter 8 introduces polynomial and trigonometric regression 
models. This chapter also discusses response curve models that are linear 
in the parameters. Class variables and the analysis of variance of designed 
experiments (models of less than full rank) are introduced in Chapter 9. 

Chapters 10 through 14 address some of the problems that might be 
encountered in regression. A general introduction to the various kinds of 
problems is given in Chapter 10. This is followed by discussions of regression 
diagnostic techniques (Chapter 11), and scaling or transforming variables 
to rectify some of the problems (Chapter 12). Analysis of the correlational 
structure of the data and biased regression are discussed as techniques 
for dealing with the collinearity problem common in observational data 
(Chapter 13). Chapter 14 is a case study illustrating the analysis of data 
in the presence of collinearity. 

Models that are nonlinear in the parameters are presented in Chapter 
15. Chapter 16 is another case study using polynomial response models, 
nonlinear modeling, transformations to linearize, and analysis of residuals. 
Chapter 17 addresses the analysis of unbalanced data. Chapter 18 (new 
to this edition) introduces linear models that have more than one random 
effect. The ordinary least squares approach to such models is given. This is 
followed by the definition of the variance-covariance matrix for such models 
and a brief introduction to mixed effects and random coefficient models. 
The use of iterative maximum likelihood estimation of both the variance 
components and the fixed effects is discussed. The final chapter, Chapter 
19, is a case study of the analysis of unbalanced data. 

We are grateful for the assistance of many in the development of this 
book. Of particular importance have been the dedicated editing of the ear- 
lier edition by Gwen Briggs, daughter of John Rawlings, and her many 
suggestions for improvement. It is uncertain when the book would have 
been finished without her support. A special thanks goes to our former 
student, Virginia Lesser, for her many contributions in reading parts of the 
manuscript, in data analysis, and in the enlistment of many data sets from 
her graduate student friends in the biological sciences. We are indebted to 
our friends, both faculty and students, at North Carolina State University 
for bringing us many interesting consulting problems over the years that 
have stimulated the teaching of this material. We are particularly indebted 
to those (acknowledged in the text) who have generously allowed the use of 
their data. In this regard, Rick Linthurst warrants special mention for his 
stimulating discussions as well as the use of his data. We acknowledge the 
encouragement and valuable discussions of colleagues in the Department 
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of Statistics at NCSU, and we thank Matthew Sommerville for checking 
answers to the exercises. We wish to thank Sharon Sullivan and Dawn 
Haines for their help with DTjjjX. Finally, we want to express appreciation 
for the critical reviews and many suggestions provided for the first edi- 
tion by the Wadsworth Brooks/Cole reviewers: Mark Conaway, University 
of Iowa; Franklin Graybill, Colorado State University; Jason Hsu, Ohio 
State University; Kenneth Koehler, Iowa State University; B. Lindsay, The 
Pennsylvania State University; Michael Meridith, Cornell University; M. 
B. Rajarshi, University of Poona (India); Muni Srivastava, University of 
Toronto; and Patricia Wahl, University of Washington; and for the second 
edition by the Springer- Verlag reviewers. 

Acknowledgment is given for the use of material in the appendix tables. 
Appendix Table A. 7 is reproduced in part from Tables 4 and 6 of Durbin 
and Watson (1951) with permission of the Biometrika Trustees. Appendix 
Table A. 8 is reproduced with permission from Shapiro and Francia (1972), 
Journal of the American Statistical Association. The remaining appendix 
tables have been computer generated by one of the authors. We gratefully 
acknowledge permission of other authors and publishers for use of material 
from their publications as noted in the text. 



Note to the Reader 

Most research is aimed at quantifing relationships among variables that 
either measure the end result of some process or are likely to affect the 
process. The process in question may be any biological, chemical, or phys- 
ical process of interest to the scientist. The quantification of the process 
may be as simple as determining the degree of association between two 
variables or as complicated as estimating the many parameters of a very 
detailed nonlinear mathematical model of the system. 

Regardless of the degree of sophistication of the model, the most com- 
monly used statistical method for estimating the parameters of interest is 
the method of least squares. The criterion applied in least squares es- 
timation is simple and has great intuitive appeal. The researcher chooses 
the model that is believed to be most appropriate for the project at hand. 
The parameters for the model are then estimated such that the predictions 
from the model and the observed data are in as good agreement as possible 
as measured by the least squares criterion, minimization of the sum of 
squared differences between the predicted and the observed points. 

Least squares estimation is a powerful research tool. Few assumptions 
are required and the estimators obtained have several desirable properties. 
Inference from research data to the true behavior of a process, however, 
can be a difficult and dangerous step due to unrecognized inadequacies 
in the data, misspecification of the model, or inappropriate inferences of 
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causality. As with any research tool it is important that the least squares 
method be thoroughly understood in order to eliminate as much misuse or 
misinterpretation of the results as possible. There is a distinct difference 
between understanding and pure memorization. Memorization can make a 
good technician, but it takes understanding to produce a master. A discus- 
sion of the geometric interpretation of least squares is given to enhance 
your understanding. You may find your first exposure to the geometry of 
least squares somewhat traumatic but the visual perception of least squares 
is worth the effort. We encourage you to tackle the topic in the spirit in 
which it is included. 

The general topic of least squares has been broadened to include statis- 
tical techniques associated with model development and testing. The 
backbone of least squares is the classical multiple regression analysis using 
the linear model to relate several independent variables to a response or 
dependent variable. Initially, this classical model is assumed to be appro- 
priate. Then methods for detecting inadequacies in this model and possible 
remedies are discussed. 

The connection between the analysis of variance for designed experiments 
and multiple regression is developed to build the foundation for the analy- 
sis of unbalanced data. (This also emphasizes the generality of the least 
squares method.) Interpretation of unbalanced data is difficult. It is impor- 
tant that the application of least squares to the analysis of such data be 
understood if the results from computer programs designed for the analysis 
of unbalanced data are to be used correctly. 

The objective of a research project determines the amount of effort to 
be devoted to the development of realistic models. If the intent is one of 
prediction only, the degree to which the model might be considered realistic 
is immaterial. The only requirement is that the predictions be adequately 
precise in the region of interest. On the other hand, realism is of primary 
importance if the goal is a thorough understanding of the system. The 
simple linear additive model can seldom be regarded as a realistic model. 
It is at best an approximation of the true model. Almost without exception, 
models developed from the basic principles of a process will be nonlinear in 
the parameters. The least squares estimation principle is still applicable but 
the mathematical methods become much more difficult. You are introduced 
to nonlinear least squares regression methods and some of the more 
common nonlinear models. 

Least squares estimation is controlled by the correlational structure ob- 
served among the independent and dependent variables in the data set. 
Observational data, data collected by observing the state of nature ac- 
cording to some sampling plan, will frequently cause special problems for 
least squares estimation because of strong correlations or, more generally, 
near-linear dependencies among the independent variables. The serious- 
ness of the problems will depend on the use to be made of the analyses. 
Understanding the correlational structure of the data is most helpful in in- 
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terpreting regression results and deciding what inferences might be made. 
Principal component analysis is introduced as an aid in characterizing the 
correlational structure of the data. A graphical procedure, Gabriel’s bi- 
plot, is introduced to help visualize the correlational structure. Principal 
component analysis also serves as an introduction to biased regression 
methods. Biased regression methods are designed to alleviate the delete- 
rious effects of near-linear dependencies (among the independent variables) 
on ordinary least squares estimation. 

Least squares estimation is a powerful research tool and, with modern 
low cost computers, is readily available. This ease of access, however, also 
facilitates misuse. Proper use of least squares requires an understanding of 
the basic method and assumptions on which it is bruit, and an awareness 
of the possible problems and their remedies. In some cases, alternative 
methods to least squares estimation might be more appropriate. It is the 
intent of this text to convey the basic understanding that will allow you to 
use least squares as an effective research tool. 

The data sets used in this text are available on the internet at 
http: / / www. stat .ncsu.edu / publications / rawlings /applied Jeast .squares 
or through a link at the Springer- Verlag page. The “readme” file explains 
the contents of each data set. 

Raleigh, North Carolina John O. Rawlings 

March 4, 1998 Sastry G. Pantula 

David A. Dickey 
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1 

REVIEW OF SIMPLE 
REGRESSION 



This chapter reviews the elementary regression results 
for a linear model in one variable. The primary purpose 
is to establish a common notation and to point out the 
need for matrix notation. A light reading should suffice 
for most students. 

Modeling refers to the development of mathematical expressions that 
describe in some sense the behavior of a random variable of interest. This 
variable may be the price of wheat in the world market, the number of 
deaths from lung cancer, the rate of growth of a particular type of tumor, 
or the tensile strength of metal wire. In all cases, this variable is called the 
dependent variable and denoted with Y. A subscript on Y identifies the 
particular unit from which the observation was taken, the time at which 
the price was recorded, the county in which the deaths were recorded, the 
experimental unit on which the tumor growth was recorded, and so forth. 
Most commonly the modeling is aimed at describing how the mean of the 
dependent variable S(Y) changes with changing conditions; the variance 
of the dependent variable is assumed to be unaffected by the changing 
conditions. 

Other variables which are thought to provide information on the behavior 
of the dependent variable are incorporated into the model as predictor or 
explanatory variables. These variables are called the independent vari- 
ables and are denoted by X with subscripts as needed to identify different 
independent variables. Additional subscripts denote the observational unit 
from which the data were taken. The A's are assumed to be known con- 




2 



1. REVIEW OF SIMPLE REGRESSION 



stants. In addition to the Xs, all models involve unknown constants, called 
parameters, which control the behavior of the model. These parameters 
are denoted by Greek letters and are to be estimated from the data. 

The mathematical complexity of the model and the degree to which 
it is a realistic model depend on how much is known about the process 
being studied and on the purpose of the modeling exercise. In preliminary 
studies of a process or in cases where prediction is the primary objective, 
the models usually fall into the class of models that are linear in the 
parameters. That is, the parameters enter the model as simple coefficients 
on the independent variables or functions of the independent variables. 
Such models are referred to loosely as linear models. The more realistic 
models, on the other hand, are often nonlinear in the parameters. Most 
growth models, for example, are nonlinear models. Nonlinear models fall 
into two categories: intrinsically linear models, which can be linearized 
by an appropriate transformation on the dependent variable, and those 
that cannot be so transformed. Most of the discussion is devoted to the 
linear class of models and to those nonlinear models that are intrinsically 
linear. Nonlinear models are discussed in Section 12.2 and Chapter 15. 



1.1 The Linear Model and Assumptions 

The simplest linear model involves only one independent variable and states 
that the true mean of the dependent variable changes at a constant rate 
as the value of the independent variable increases or decreases. Thus, the 
functional relationship between the true mean of Y. t . denoted by £(Yi), and 
Xi is the equation of a straight line: 



£(Yi)=P o + ftl,. (1.1) 

/3q is the intercept, the value of £(Yj) when X = 0, and [3i is the slope of 
the line, the rate of change in £(Yj) per unit change in X. 

The observations on the dependent variable Yj are assumed to be random 
observations from populations of random variables with the mean of each 
population given by £(Yj). The deviation of an observation Y, from its 
population mean £(!')) is taken into account by adding a random error e, 
to give the statistical model 



Yi — 0o + PiXi + €{. (1-2) 

The subscript i indicates the particular observational unit, i = 1, 2, . . . , n. 
The Xi are the n observations on the independent variable and are assumed 
to be measured without error. That is, the observed values of X are assumed 
to be a set of known constants. The Y t and Xj are paired observations; both 
are measured on every observational unit. 



Model 



Assumptions 
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The random errors Ci have zero mean and are assumed to have common 
variance a 2 and to be pairwise independent. Since the only random element 
in the model is ej, these assumptions imply that the Y) also have common 
variance er 2 and are pairwise independent. For purposes of making tests 
of significance, the random errors are assumed to be normally distributed, 
which implies that the Y t are also normally distributed. The random error 
assumptions are frequently stated as 

a ~ NID(0,a 2 ), (1.3) 

where NID stands for “normally and independently distributed.” The quan- 
tities in parentheses denote the mean and the variance, respectively, of the 
normal distribution. 



1.2 Least Squares Estimation 

The simple linear model has two parameters /?o and /?i, which are to be 
estimated from the data. If there were no random error in Y), any two data 
points could be used to solve explicitly for the values of the parameters. 
The random variation in Y, however, causes each pair of observed data 
points to give different results. (All estimates would be identical only if the 
observed data fell exactly on the straight line.) A method is needed that 
will combine all the information to give one solution which is “best” by 
some criterion. 

The least squares estimation procedure uses the criterion that the 
solution must give the smallest possible sum of squared deviations of the 
observed Y t from the estimates of their true means provided by the solu- 
tion. Let f3 0 and f3\ be numerical estimates of the parameters (3q and /3j, 
respectively, and let 

% = % o+AXi (1.4) 

be the estimated mean of Y for each A'j, i = 1, . . . , n. Note that Y) is ob- 
tained by substituting the estimates for the parameters in the functional 
form of the model relating £(Y;) to X, . equation 1.1. The least squares prin- 
ciple chooses /3 0 and that minimize the sum of squares of the residuals, 
SS(Res): 



SS(Res) = ^JY'-f;) 2 

i= 1 

= (i-5) 

where ej = (Y) — Y)) is the observed residual for the ith observation. The 
summation indicated by ^ is over all observations in the data set as indi- 
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cated by the index of summation, i = 1 to n. (The index of summation is 
omitted when the limits of summation are clear from the context.) 

The estimators for 0o and /?i are obtained by using calculus to find the 
values that minimize SS(Res). The derivatives of SS(Res) with respect to 
A, and /?i in turn are set equal to zero. This gives two equations in two 
unknowns called the normal equations: 

4) + (^,)ft = E r * 

(£*)& + (£*?)& = E x * y - ( L6 ) 

Solving the normal equations simultaneously for 0q and (3\ gives the esti- 
mates of /?i and /3q as 

a = UXr-xm-7) _ 

J2(Xt-x) 2 £*? 

Po = Y-diX. (1.7) 

Note that Xi = (Xi — X) and = (Y) — Y) denote observations expressed 
as deviations from their sample means X and Y , respectively. The more 
convenient forms for hand computation of sums of squares and sums of 
products are 




E- Y '- 

E 



( E £) 2 

n 

(ZXiHZ £) 

n 



( 1 . 8 ) 



Thus, the computational formula for the slope is 



Pi 



E XjYj- 

E xf 



(EehEe) 

n 

_ (£w) 2 



These estimates of the parameters give the regression equation 



(1.9) 



Yi= 0o + PiXi. 



( 1 . 10 ) 



The computations for the linear regression analysis are illustrated using 
treatment mean data from a study conducted by Dr. A. S. Heagle at North 
Carolina State University on effects of ozone pollution on soybean yield 
(Table 1.1). Four dose levels of ozone and the resulting mean seed yield of 
soybeans are given. The dose of ozone is the average concentration (parts 
per million, ppm) during the growing season. Yield is reported in grams 
per plant. 
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TABLE 1.1. Mean yields of soybean plants (gms per plant) obtained in response 
to the indicated levels of ozone exposure over the growing season. (Data courtesy 
of Dr. A. S. Heagle, USD A and North Carolina State University.) 



X 






Y 


Ozone (ppm) 


Yield (gm/plt) 




.02 




242 




.07 




237 




.11 




231 




.15 




201 


E x i: = 


.35 


EE 


= 911 


X = 


.0875 


Y 


= 227.75 


E x i = 


.0399 


EE 2 


= 208,495 




E X,Yi = 76.99 







Assuming a linear relationship between yield and ozone dose, the simple 
linear model, described by equation 1.2, is appropriate. The estimates of 
I3 0 and 0i obtained from equations 1.7 and 1.9 are 



0i 

0o 



76 ,99 _ hMpTl) 

.0399 - 



-293.531 



227.75 - (— 293. 531)(. 0875) = 253.434. 



( 1 . 11 ) 



The least squares regression equation characterizing the effects of ozone 
on the mean yield of soybeans in this study, assuming the linear model is 
correct, is 



% = 253.434 - 293.531AV (1.12) 

The interpretation of 0i = —294 is that the mean yield is expected to 
decrease, since the slope is negative, by 294 grams per plant with each 
1 ppm increase in ozone, or 2.94 grams with each .01 ppm increase in 
ozone. The observed range of ozone levels in the experiment was .02 ppm 
to .15 ppm. Therefore, it would be an unreasonable extrapolation to expect 
this rate of decrease in yield to continue if ozone levels were to increase, for 
example, to as much as 1 ppm. It is safe to use the results of regression only 
within the range of values of the independent variable. The intercept, 0o = 
253 grams, is the value of Y where the regression line crosses the l''-axis. 
In this case, since the lowest dose is .02 ppm, it would be an extrapolation 
to interpret 0o as the estimate of the mean yield when there is no ozone. ■ 
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TABLE 1.2. Observed values, estimated values, and residuals for the linear re- 
gression of soybean yield on ozone dosage. 



Yi 


1): 


e* 


e 2 


242 


247.563 


-5.563 


30.947 


237 


232.887 


4.113 


16.917 


231 


221.146 


9.854 


97.101 


201 


209.404 


-8.404 


70.627 






Eg = 0.0 


Ee? = 215.592 



1.3 Predicted Values and Residuals 

The regression equation from Example 1.1 can be evaluated to obtain es- 
timates of the mean of the dependent variable Y at chosen levels of the 
independent variable. Of course, the validity of such estimates is depen- 
dent on the assumed model being correct, or at least a good approximation 
to the correct model within the limits of the pollution doses observed in 
the study. 

Each quantity computed from the fitted regression line Yi is used as both 
(1) the estimate of the population mean of Y for that particular value of 
X and (2) the prediction of the value of Y one might obtain on some 
future observation at that level of X. Hence, the Y-, are referred to both 
as estimates and as predicted values. On occasion we write Y pre d 4 to 
clearly imply the second interpretation. 

If the observed values Yi in the data set are compared with their cor- 
responding values Yi computed from the regression equation, a measure 
of the degree of agreement between the model and the data is obtained. 
Remember that the least squares principle makes this agreement as “good 
as possible” in the least squares sense. The residuals 

e* = Yi - Yi (1.13) 

measure the discrepancy between the data and the fitted model. The results 
for Example 1.1 are shown in Table 1.2. Notice that the residuals sum to 
zero, as they always will when the model includes the constant term /3q. 
The least squares estimation procedure has minimized the sum of squares 
of the e*. That is, there is no other choice of values for the two parameters 
/3q and f3i that will provide a smaller E ef • 



A plot of the regression equation and the data from Example 1.1 (Fig- 
ure 1.1) provides a visual check on the arithmetic and the adequacy with 
which the equation characterizes the data. The regression line crosses the 
T-axis at the value of /3 0 = 253.4. The negative sign on j3\ is reflected in 
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FIGURE 1.1. Regression of soybean yield on ozone level. 

the negative slope. Inspection of the plot shows that the regression line 
decreases to approximately Y = 223 when A' = .1. This is a decrease of 
30 grams of yield over a .1 ppm increase in ozone, or a rate of change 
of —300 grams in Y for each unit increase in X. This is reasonably close 
to the computed value of —293.5 grams per ppm. Figure 1.1 shows that 
the regression line “passes through” the data as well as could be expected 
from a straight-line relationship. The pattern of the deviations from the re- 
gression line, however, suggests that the linear model may not adequately 
represent the relationship. ■ 



1.4 Analysis of Variation in the Dependent 
Variable 

The residuals are defined in equation 1.13 as the deviations of the observed 
values from the estimated values provided by the regression equation. Al- 
ternatively, each observed value of the dependent variable Yi can be written 
as the sum of the estimated population mean of Y for the given value of 
X and the corresponding residual: 

Yi^Yi+a. (1.14) 

Y is the part of the observation Y t “accounted for” by the model, whereas 
e* reflects the “unaccounted for” part. 

The total uncorrected sum of squares of Yi, SS(Total unC orr) = 
y]17 2 , can be similarly partitioned. Substitute V; + e* for each Y) and 



SS(Model) 
and SS(RES) 
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expand the square. Thus, 

E ^ 2 = E ^ + e *) 2 
= E r < 2 • E <2 

= SS(Model) + SS(Res). (1.15) 

(The cross-product term T)ej is zero, as can readily be shown with the 
matrix notation of Chapter 3. Also see Exercise 1.22.) The term SS(Model) 
is the sum of squares “accounted for” by the model; SS(Res) is the “un- 
accounted for” part of the sum of squares. The forms SS(Model) = W 
and SS(Res) = show the origins of these sums of squares. The more 
convenient computational forms are 

SS(Model) = nY 2 +PlJ2( x i-X) 2 

SS(Res) = SS(Total uncorr ) - SS(Model). (1.16) 

The partitioning of the total uncorrected sum of squares can be reexpressed 

in terms of the corrected sum of squares by subtracting the sum of 

2 

squares due to correction for the mean, the correction factor nY , from 
each side of equation 1.15: 

SS(Total uncorr ) - nY 2 = [SS(Model) - nY 2 } + SS(Res) 
or, using equation 1.16, 

E ^ 2 = 3 ? E ( A '*- X ) 2 + E e2 

= SS(Regr) + SS(Res). (1.17) 

Notice that lower case y is the deviation of Y from Y so that Vi is the 
corrected total sum of squares. Henceforth, SS(Total) is used to denote 
the corrected sum of squares of the dependent variable. Also notice that 
SS(Moclel) denotes the sum of squares attributable to the entire model, 
whereas SS(Regr) denotes only that part of SS (Model) that exceeds the 
correction factor. The correction factor is the sum of squares for a model 
that contains only the constant term (5q. Such a model postulates that the 
mean of Y is a constant, or is unaffected by changes in X. Thus, SS(Regr) 
measures the additional information provided by the independent variable. 

The degrees of freedom associated with each sum of squares is determined 
by the sample size n and the number of parameters p' in the model. [We 
use p' to denote the number of parameters in the model and p (without 
the prime) to denote the number of independent variables; p' = p + 1 when 
the model includes an intercept as in equation 1.2.] The degrees of freedom 
associated with SS(Model) is p' = 2; the degrees of freedom associated with 
SS(Regr) is always 1 less to account for subtraction of the correction factor, 
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TABLE 1.3. Partitions of the degrees of freedom and sums of squares for yield of 
soybeans exposed to ozone (courtesy of Dr. A. S. H eagle, N.C. State University). 



Source of 
Variation 


Degrees of 
Freedom 


Sum of Squares 


Mean 

Square 


Total uncorr 


n 


=4 


E Y? = 


208,495.00 




Corr. factor 




1 


n ¥ 2 = 


207,480.25 




Total corr 


n — 1 


=3 


Ey? = 


1,014.75 




Due to model 


p' 


=2 


EE 2 = 


208,279.39 




Corr. factor 




1 




207,480.25 




Due to regr. 


p '- 1 


=1 


'ZY?- nY 2 = 


799.14 


799.14 


Residual 


n — p' 


=2 


Ee 2 = 


215.61 


107.81 



TABLE 1.4. Analysis of variance of yield of soybeans exposed to ozone pollution 
(courtesy of Dr. A. S. Heagle, N.C. State University). 



Source 


d.f. 


SS 


MS 


Total 


3 


1014.75 




Due to regr. 


1 


799.14 


799.14 


Residual 


2 


215.61 


107.81 



which has 1 degree of freedom. SS(Res) will contain the (n — p') degrees of 
freedom not accounted for by SS (Model). The mean squares are found by 
dividing each sum of squares by its degrees of freedom. 



The partitions of the degrees of freedom and sums of squares for the ozone 
data from Example 1.1 are given in Table 1.3. The definitional formulae 
for the sums of squares are included. An abbreviated form of Table 1.3, 
omitting the total uncorrected sum of squares, the correction factor, and 
SS(Model), is usually presented as the analysis of variance table (Table 1.4). 



One measure of the contribution of the independent variable(s) in the 
model is the coefficient of determination, denoted by R 2 : 



,2 _ SS(Regr) 

E yf 



(1.18) 



This is the proportion of the (corrected) sum of squares of Y attributable to 
the information obtained from the independent variable(s). The coefficient 
of determination ranges from zero to one and is the square of the product 
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moment correlation between Yj and Yj. If there is only one independent 
variable, it is also the square of the correlation coefficient between Y t and 
Xi. 



The coefficient of determination for the ozone data from Example 1.1 is 



R 2 



799.14 

1,014.75 



.7875. 



The interpretation of R 2 is that 79% of the variation in the dependent 
variable, yield of soybeans, is “explained” by its linear relationship with 
the independent variable, ozone level. Caution must be exercised in the 
interpretation given to the phrase “explained by X.' 1 In this example, the 
data are from a controlled experiment where the level of ozone was being 
controlled in a properly replicated and randomized experiment. It is there- 
fore reasonable to infer that any significant association of the variation in 
yield with variation in the level of ozone reflects a causal effect of the pol- 
lutant. If the data had been observational data, random observations on 
nature as it existed at some point in time and space, there would be no 
basis for inferring causality. Model-fitting can only reflect associations in 
the data. With observational data there are many reasons for associations 
among variables, only one of which is causality. ■ 

If the model is correct, the residual mean square is an unbiased estimate 
of a 2 , the variance among the random errors. The regression mean square 
is an unbiased estimate of a 2 + where J2 x i ~ — A') 2 . 

These are referred to as the mean square expectations and are denoted 
by £[MS(Res)] and £[MS(Regr)]. Notice that MS(Regr) is estimating the 
same quantity as MS(Res) plus a positive quantity that depends on the 
magnitude of (3\ and Thus, any linear relationship between Y and 

X , where /3i 0, will on the average make MS(Regr) larger than MS(Res). 

Comparison of MS(Regr) to MS(Res) provides the basis for judging the 
importance of the relationship. 

The estimate of a 2 is denoted by s 2 . For the data of Example 1.1, 
MS(Res) = s 2 = 107.81 (Table 1.4). MS(Regr) = 799.14 is much larger 
than s 2 , which suggests that (3i is not zero. Testing of the null hypothesis 
that /?i = 0 is discussed in Section 1.6. ■ 
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1.5 Precision of Estimates 

Any quantity computed from random variables is itself a random variable. 
Thus, Y , Y, e, /3g, and /?i are random variables computed from the Y t . 
Measures of precision, variances or standard errors of the estimates, provide 
a basis for judging the reliability of the estimates. 

The computed regression coefficients, the Yj, and the residuals are all 
linear functions of the Y,. Their variances can be determined using the 
basic definition of the variance of a linear function. Let U = (liY, be 
an arbitrary linear function of the random variables Yj, where the a, are 
constants. The general formula for the variance of U is 

Var (U) = afVar(Yj) + ££* o i o J -Cov(: Yi,Yj), (1.19) 

where the double summation is over all n(n — 1) possible pairs of terms 
where i and j are not equal. Cov(-,-) denotes the covariance between the 
two variables indicated in the parentheses. (Covariance measures the ten- 
dency of two variables to increase or decrease together.) When the random 
variables are independent, as is assumed in the usual regression model, all 
of the covariances are zero and the double summation term disappears. If, 
in addition, the variances of the random variables are equal, again as in 
the usual regression model where Var(Yj) = a 2 for all i, the variance of the 
linear function reduces to 

Var(U) = (^o?)(t 2 . (1.20) 

Variances of linear functions play an extremely important role in every 
aspect of statistics. Understanding the derivation of variances of linear 
functions will prove valuable; for this reason, we now give several examples. 



The variance of the sample mean of n observations is derived. The co- 
efficient ai on each Yj in the sample mean is 1/n. If the Yj have common 
variance a 2 and zero covariances (for example, if they are independent), 
equation 1.20 applies. The sum of squares of the coefficients is 




1 

n 



and the variance of the mean becomes 



Var(Y) 



a 

n 



( 1 . 21 ) 
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In this example, the variance is derived for a linear contrast of three 
treatment means, 



C = Y 1 + Y 2 -2Y 3 . (1.22) 

If each mean is the average of n independent observations from the same 
population, the variance of each sample mean is equal to Var(V i ) = a 1 jn 
and all covariances are zero. The coefficients on the Y, are 1, 1, and -2. 
Thus, 



Var(C) = (l) 2 Var(Yi) + (l) 2 Var(F 2 ) + (-2) 2 Var(Y 3 ) 

= (I + I+ 4 ) (v) =«(£)■ ih23 > 



We now turn to deriving the variances of ft, ft, and 1). To determine 
the variance of ft express 



ft = (1.24) 

z>< 

as 

% = (L25) 

(See Exercise 1.16 for justification for replacing y t with 1).) The coefficient 
on each 1) is Xi/^x 2 , which is a constant in the regression model. The 
Yi are assumed to be independent and to have common variance a 2 . Thus, 
the variance of ft is 

v ar(A) = (^) a 2 + ( N ? 2 ) E + 

S> 2 2 ^ 

(E ^ 2 ) 2 EE' 

Determining the variance of the intercept 

A) = Y-PiX (1.27) 

is a little more involved. The random variables in this linear function are 
Y and /3i ; the coefficients are 1 and (— X ). Equation 1.19 can be used to 
obtain the variance of ft: 

Var(ft) = Var(F) + (-V) 2 Var(ft) + 2(-X)Cov(F, ft). (1.28) 




(1.26) 
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It has been shown that the Var(Y) = a 2 /n and Var(/?i) = cr 2 /Y! x i> but 
Cov(Y,Pi) remains to be determined. 

The covariance between two linear functions is only slightly more com- 
plicated than the variance of a single linear function. Let U be the linear 
function defined earlier with a,; as coefficients and let IT be a second linear 
function of the same random variables using d,; as coefficients: 

U = <*i Y i and W = d i Y i ■ 

The covariance of U and W is given by 

Co v{U, W) = ^a,;d;V ar(Yi) + ££* aidjCov(Yi, Yj), (1.29) 

where the double summation is again over all n(n — 1) possible combina- 
tions of different values of the subscripts. If the Yj are independent, the 
covariances are zero and equation 1.29 reduces to 

Cov([7, W) = Y^ a idiVar(Yi). (1.30) 

Note that products of the corresponding coefficients are being used, whereas 
the squares of the coefficients were used in obtaining the variance of a linear 
function. 

Returning to the derivation of Var(/3 0 ), where U and W are Y and (3i, we 
note that the corresponding coefficients for each Y) are 1/n and x%/ y~) x 2 , 
respectively. Thus, the covariance between Y and (3 \ is 

Co v(y,&) = 



since x, = 0. Thus, the variance of reduces to 

Var(3b) = 



Recall that /3 0 is the estimated mean of Y when X = 0, and thus Var(/J 0 ) 
can be thought of as the Var(Y) for X = 0. The formula for Var(/3 0 ) can 
be used to obtain the variance of Y) for any given value of X, by replacing 
X with (Xi — X). Since 

% = p Q + %X i = Y + 'jh{X i -X), 



Var(Y) + (A') 2 Var(/?! 
n Y. x i 



1 x 2 

n J2 x i 



a 2 . 



(1.32) 



s(i)fe) 



Var(Y ; ) 



E Xi 

E ^ 2 



(1.31) 
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we have 



Var (Y)) 



1 , (E-x) 2 

« E ^ 2 



(1.34) 



The variance of the fitted value attains its minimum of a 2 /n when the 
regression equation is being evaluated at Xi = X, and increases as the 
value of X at which the equation is being evaluated moves away from A'. 
Equation 1.34 gives the appropriate variance when Y t is being used as the 
estimate of the true mean (3 0 + /3jX, of Y at the specific value X, of A'. 

Consider the problem of predicting some future observation Yq = (3q + 
(5\Xq +eo, at a specific value Xo of X, where co is assumed to be X(0, a 2 ), 
independent of the current observations. Recall that Yq = /3 0 + /3iX 0 is 
used as an estimate of the mean f3o + /?iXo of To- Since the best prediction 
for £q is its mean zero, Yq is also used as the predictor of Yq. The variance 
for prediction must take into account the fact that the quantity being 
predicted is itself a random variable. The success of the prediction will 
depend on how small the difference is between Yq and the future observation 
Y). The difference Yo — Yq is called the prediction error. The average 
squared difference between Y 0 and Yq, £(Yq — Yo) 2 , is called the mean 
squared error of prediction. If the model is correct and prediction is for 
an individual in the same population from which the data were obtained, 
so that £(Y 0 — Yq) = 0, the mean squared error is also the variance of 
prediction. Assuming this to be the case, the variance for prediction 
Var (Ypredg) is the variance of the difference between Yq and the future 
observation Yq: 



Var(Yp, redo ) 



Var(Yo - Y 0 ) 
Var(lo) + a 2 




(X 0 -X) 2 1 

E* 2 . 



(1.35) 



Comparing equation 1.35 with equation 1.34, where Xq is a particular Xj, 
we observe that the variance for prediction is the variance for estimation 
of the mean plus the variance of the quantity being predicted. 

The derived variances are the true variances; they depend on knowl- 
edge of a 2 . Var(-) and a 2 are used to designate true variances. Estimated 
variances are obtained by replacing a 2 in the variance equations with an 
estimate of a 2 . The residual mean square from the analysis provides an es- 
timate of a 2 if the correct model has been fitted. As shown later, estimates 
of <J 2 that are not dependent on the correct regression model being used are 
available in some cases. The estimated variances obtained by substituting 
s 2 for a 2 are denoted by s 2 (-), with the quantity in parentheses designating 
the random variable to which the variance applies. 
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TABLE 1.5. Summary of i\ 



Formula 
Po = Y - AX 
Yi=do+Axi 

e t = Yi - Yf 

SS(Total uncorr ) = £1? 

SS(Total) = E^ 2 -(E^) 2 /« 

SS(Model) = nY 2 + /3?(£ xf) 

SS(Regr) = 3i(E x i) 

SS(Res) = SS(Total) - SS(Regr) 

R 2 = SS(Regr)/SS(Total) 

s 2 03i) = s 2 /E^ 2 

o)= s 2 

a+(x»-x) 2 ] 

2 ^ 

£*? 

ri +i+(Y 0 -Y) 2 i 
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inf formulae in simple regression. 
Estimate of ( or formula for) 

Po 

£(Yi) 

Si 

Total uncorrected sum of squares 
Total corrected sum of squares 
Sum of squares due to model 
Sum of squares due to X 
Residual sum of squares 
Coefficient of determination 
Variance of pi 
Variance of /3 0 

Variance of estimated mean at X, 



Variance of prediction at A'o 
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Table 1.5 provides a summary to this point of the important formulae in 
linear regression with one independent variable. 



For the ozone data from Example 1.1, s 2 = 107.81, n = 4, and )P xf = 
[.0399 — (.35) 2 /4[ — .009275. Thus, the estimated variances for the linear 
functions are: 



s 2 0i) 

s 2 0o) 



s 2 (Yi) 



s 2 _ 107.81 
£>? ~ .009275 



1 X 2 \ 

n + Zx?) 



11,623.281 



1 (.0875) 2 

4 + .009275 



(107.81) = 115.942 



(l (Xi-X) 2 \ 

U e- l? J 



s 



2 



1 (.02 - .0875) 2 

4 + .009275 



(107.81) 



79.91. 



Making appropriate changes in the values of Xi gives the variances of the 
remaining 1): 

s 2 {Y 2 ) = 30.51, 

s 2 (E) = 32.84, and 

s 2 (Y 4 ) = 72.35. 



Note that Y\ may also be used to predict the yield Y 0 of a future observation 
at the ozone level X 0 = X\ = .02. The variance for prediction of Yo would 
be Var(Ti) increased by the amount a 2 . Thus, an estimated variance of 
prediction for Y 0 is s 2 (Yi) + s 2 = 187.72. Similarly, the estimated variance 
for predictions of future yields at ozone levels 0.07, 0.11, and 0.15 are 138.32, 
140.65, and 180.16, respectively. ■ 



1.6 Tests of Significance and Confidence Intervals 

The most common hypothesis of interest in simple linear regression is the 
hypothesis that the true value of the linear regression coefficient, the slope, 
is zero. This says that the dependent variable Y shows neither a linear 
increase nor decrease as the independent variable changes. In some cases, 
the nature of the problem will suggest other values for the null hypothesis. 
The computed regression coefficients, being random variables, will never 
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exactly equal the hypothesized value even when the hypothesis is true. 
The role of the test of significance is to protect against being misled by the 
random variation in the estimates. Is the difference between the observed 
value of the parameter f3\ and the hypothesized value of the parameter 
greater than can be reasonably attributed to random variation? If so, the 
null hypothesis is rejected. 

To accommodate the more general case, the null hypothesis is written 
as Hq : (3i = m, where m is any constant of interest and of course can be 
equal to zero. The alternative hypothesis is H a : pi ^ m, H a : f3\ > m, 
or H a : /3i < m depending on the expected behavior of Pi if the null 
hypothesis is not true. In the first case, H a : (3\ y? m is referred to as the 
two-tailed alternative hypothesis (interest is in detecting departures of Pi 
from m in either direction) and leads to a two-tailed test of significance. 
The latter two alternative hypotheses, H a : Pi > m and H a : pi < m, are 
one-tailed alternatives and lead to one-tailed tests of significance. 

If the random errors in the model, the £j, are normally distributed, the 
Yi and any linear function of the 1) will be normally distributed [see Searle 
(1971)]. Thus, /?i is normally distributed with mean pi (Pi is shown to be 
unbiased in Chapter 3) and variance Var(/3i). If the null hypothesis that 
/?i = m is true, then f3\ — m is normally distributed with mean zero. Thus, 



Pi — m 
s(P l) 



(1.36) 



is distributed as Student’s t with degrees of freedom determined by the 
degrees of freedom in the estimate of a 2 in the denominator. The com- 
puted f-value is compared to the appropriate critical value of Student’s t, 
(Appendix Table A), determined by the Type I error rate a and whether 
the alternative hypothesis is one-tailed or two-tailed. The critical value of 
Student’s t for the two-tailed alternative hypothesis places probability a/2 
in each tail of the distribution. The critical values for the one-tailed alter- 
native hypotheses place probability a in only the upper or lower tail of the 
distribution, depending on whether the alternative is Pi > m or Pi < m , 
respectively. 



The estimate of Pi for Heagle’s ozone data from Example 1.1 was Pi = 
—293.53 with a standard error of s(pi) = yTl, 623.281 = 107.81. Thus, the 
computed t - value for the test of H 0 : Pi = 0 is 



t 



-293.53 

107.81 



-2.72. 



The estimate of a 2 in this example has only two degrees of freedom. Using 
the two-tailed alternative hypothesis and a = .05 gives a critical f-value of 
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^ (. 025 , 2 ) = 4.303. Since |t| < 4.303, the conclusion is that the data do not 
provide convincing evidence that /3i is different from zero. 

In this example one might expect the increasing levels of ozone to depress 
the yield of soybeans; that is, the slope would be negative if not zero. The 
appropriate one-tailed alternative hypothesis would be H a : /3i < 0. For 
this one-tailed test, the critical value of t for a = .05 is f(. 05 , 2 ) = 2.920. 
Although the magnitude of the computed t is close to this critical value, 
strict adherence to the a = .05 size of test leads to the conclusion that 
there is insufficient evidence in these data to infer a real (linear) effect of 
ozone on soybean yield. (From a practical point of view, one would begin 
to suspect a real effect of ozone and seek more conclusive data.) ■ 



In a similar manner, f-tests of hypotheses about (3q and any of the Y) can 
be constructed. In each case, the numerator of the t-statistic is the differ- 
ence between the estimated value of the parameter and the hypothesized 
value, and the denominator is the standard deviation (or standard error) of 
the estimate. The degrees of freedom for Student’s t is always the degrees 
of freedom associated with the estimate of a 2 . 

The F-statistic can be used as an alternative to Student’s t for two-tailed 
hypotheses about the regression coefficients. It was indicated earlier that 
MS(Regr) is an estimate of <x 2 + /3f ^2 x 2 and that MS (Res) is an estimate of 
a 2 . If the null hypothesis that /3i = 0 is true, both MS(Regr) and MS(Res) 
are estimating a 2 . As pi deviates from zero, MS(Regr) will become increas- 
ingly larger (on the average) than MS (Res). Therefore, a ratio of MS(Regr) 
to MS(Res) appreciably larger than unity would suggest that /3i is not zero. 
This ratio of MS(Regr) to MS(Res) follows the F-distribution when the as- 
sumption that the residuals are normally distributed is valid and the null 
hypothesis is true. 



For the ozone data of Example 1.1, the ratio of variances is 

_ MS(Regr) = 799.14 
MS (Res) 107.81 



This can be compared to the critical value of the F-distribution with 1 
degree of freedom in the numerator and 2 degrees of freedom in the denom- 
inator, F( 05 ,i, 2 ) = 18.51 for a = .05 (Appendix Table A. 3), to determine 
whether MS(Regr) is sufficiently larger than MS(Res) to rule out chance as 
the explanation. Since F = 7.41 < 18.51, the conclusion is that the data do 
not provide conclusive evidence of a linear effect of ozone. The F-ratio with 
1 degree of freedom in the numerator is the square of the corresponding 
t-statistic. Therefore, the F and the f are equivalent tests for this two-tailed 
alternative hypothesis. ■ 
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Confidence interval estimates of parameters are more informative 
than point estimates because they reflect the precision of the estimates. 
The 95% confidence interval estimate of Pi and po are, respectively, 



and 



Pi t(.025,v) s (Pl) 


(1.37) 


Po ± f(.025,id s (A))> 


(1.38) 



where v is the degrees of freedom associated with s 2 . 



The 95% confidence interval estimate of pi for Example 1.1 is 
-293.53 ± (4.303)(107.81) 

or (-757, 170). 

The confidence interval estimate indicates that the true value may fall 
anywhere between —757 and 170. This very wide range conveys a high de- 
gree of uncertainty (lack of confidence) in the point estimate Pi = —293.53. 
Notice that the interval includes zero. This is consistent with the conclu- 
sions from the t-test and the F-test that H 0 : (3 1 = 0 cannot be rejected. 
The 95% confidence interval estimate of /J 0 is 

253.43 ± (4.303)(10.77) 

or (207.1, 299.8). The value of f3 0 might reasonably be expected to fall 
anywhere between 207 and 300 based on the information provided by this 
study. ■ 



In a similar manner, interval estimates of the true mean of Y for various 
values of X are computed using 1); and their standard errors. Frequently, 
these confidence interval estimates of £(Y)) are plotted with the regression 
line and the observed data. Such graphs convey an overall picture of how 
well the regression represents the data and the degree of confidence one 
might place in the results. Figure 1.2 shows the results for the ozone exam- 
ple. The confidence coefficient of .95 applies individually to the confidence 
intervals on each estimated mean. Simultaneous confidence intervals are 
discussed in Section 4.6. 

The failure of the tests of significance to detect an effect of ozone on the 
yield of soybeans is, in this case, a reflection of the lack of power in this 
small data set. This lack of power is due primarily to the limited degrees of 
freedom available for estimating a 2 . In defense of the research project from 
which these data were borrowed, we must point out that only a portion of 
the data (the set of treatment means) is being used for this illustration. The 
complete data set from this experiment provides for an adequate estimate 
of error and shows that the effects of ozone are highly significant. The 
complete data are used at a later time. 



Confidence 

Intervals 
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1.7 Regression Through the Origin 

In some situations the regression line is expected to pass through the origin. 
That is, the true mean of the dependent variable is expected to be zero when 
the value of the independent variable is zero. Many growth models, for 
example, would pass through the origin. The amount of chemical produced 
in a system requiring a catalyst would be zero when there is no catalyst 
present. The linear regression model is forced to pass through the origin by 
setting /?o equal to zero. The linear model then becomes 



Yi = /?i Xi + Cj . 



(1.39) 



There is now only one parameter to be estimated and application of the 
least squares principle gives 

= (!- 40 ) 



as the only normal equation to be solved. The solution is 



Pi 



E x i Y t 
EE 2 ' 



(1.41) 



Both the numerator and denominator are now uncorrected sums of products 
and squares. The regression equation becomes 



Yi = A X u (1.42) 

and the residuals are defined as before, 

e» = Yi - Yi. (1-43) 



Unlike the model with an intercept, in the no-intercept model the sum of 
the residuals is not necessarily zero. 

The uncorrected sum of squares of Y can still be partitioned into the 
two parts 



SS(Model) = Y? 



(1.44) 



and 



SS(Res) = J2(Xi ~ ?i) 2 = E R 45 ) 

Since only one parameter is involved in determining fy, SS (Model) has only 
1 degree of freedom and cannot be further partitioned into the correction for 
the mean and SS(Regr). For the same reason, the residual sum of squares 
has (n— 1) degrees of freedom. The residual mean square is an estimate of a 2 
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if the model is correct. The expectation of the MS(Model) is £[MS(Model)] 
= a 2 + Xf)- This is the same form as £[MS(Regr)] for a model with 
an intercept except here the sum of squares for X is the uncorrected sum 
of squares. 

The variance of f3\ is determined using the rules for the variance of a 
linear function (see equations 1.25 and 1.26). The coefficients on the Y t 
for the no-intercept model are Xi/^Xj. With the same assumptions of 
independence of the Y) and common variance cr 2 , the variance of j3\ is 



Var(/?i) 






(1.46) 



The divisor on cr 2 , the uncorrected sum of squares for the independent 
variable, will always be larger (usually much larger) than the corrected 
sum of squares. Therefore, the estimate of j3\ in equation 1.41 will be much 
more precise than the estimate in equation 1.9 when a no-intercept model 
is appropriate. This results because one parameter, /3q, is assumed to be 
known. 

The variance of 1) is most easily obtained by viewing it as a linear func- 
tion of /3i : 



Y i =X i 0 1 . 



Thus, the variance is 

Var (Y<) = XfX ar(ft) 




(1.47) 



(1.48) 



Estimates of the variances are obtained by substitution of s 2 for a 2 . 



Regression through the origin is illustrated using data on increased risk 
incurred by individuals exposed to a toxic agent. Such health risks are often 
expressed as relative risk, the ratio of the rate of incidence of the health 
problem for those exposed to the rate of incidence for those not exposed 
to the toxic agent. A relative risk of 1.0 implies no increased risk of the 
disease from exposure to the agent. Table 1.6 gives the relative risk to 
individuals exposed to differing levels of dust in their work environments. 
Dust exposure is measured as the average number of particles/ft 3 /year 
scaled by dividing by 10 6 . By definition, the expected relative risk is 1.0 
when exposure is zero. Thus, the regression line relating relative risk to 
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TABLE 1.6. Relative risk of exposure to dust for nine groups of individuals. Dust 
exposure is reported in particles /ft 3 /year and scaled by dividing by 10 6 . 



X = Dust Exposure 


Relative Risk 


Y = Relative Risk — 1 


75 


1.10 


.10 


100 


1.05 


.05 


150 


.97 


-.03 


350 


1.90 


.90 


600 


1.83 


.83 


900 


2.45 


1.45 


1,300 


3.70 


2.70 


1,650 


3.52 


2.52 


2,250 


4.16 


3.16 


EE = 7,375 




EE = 11-68 


E E 2 = 10 - 805, 625 


E EE = 16,904 


E E 2 = 27.2408 



exposure should have an intercept of 1.0 or, equivalently, the regression 
line relating Y = (relative risk — 1) to exposure should pass through the 
origin. The variable Y and key summary statistics ou X and Y are included 
in Table 1.6. 

Assuming a linear relationship and zero intercept, the point estimate of 
the slope /?i of the regression line is 



A 



E EE 

EE 2 



16,904 

10,805,625 



.00156. 



The estimated increase in relative risk is .00156 for each increase in dust 
exposure of 1 milliou particles per cubic foot per year. The regression equa- 
tion is 



E = .00156Xj. 

When X,j = 0, the value of E is zero and the regression equation has been 
forced to pass through the origin. 

The regression partitions each observation Y t into two parts; that ac- 
counted for by the regression through the origin E, and the residual or 
deviation from the regression line e* (Table 1.7). The sum of squares at- 
tributable to the model, 

SS (Model) # ^ Y? = 26.4441, 



and the sum of squares of the residuals, 

SS(Res) = J2 e i = - 7967 ’ 
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TABLE 1.7. Yi, Yi, and e* from linear regression through the origin of increase 
in relative risk (Y = relative risk — 1) on exposure level. 



Yi Yi a 


.10 


.1173 


-.0173 


.05 


.1564 


-.1064 


-.03 


.2347 


-.2647 


.90 


.5475 


.3525 


.83 


.9386 


-.1086 


1.45 


1.4079 


.0421 


2.70 


2.0337 


.6663 


2.52 


2.5812 


-.0612 


3.16 


3.5198 


-.3598 


'fTY 2 = 27.2408 


£ Y 2 = 26.4441 


]Te 2 = .7967 



TABLE 1.8. Summary analysis of variance for regression through the origin of 
increase in relative risk on level of exposure to dust particles. 



Source 


d.f. 


ss 


MS 


£{MS) 


Tot al uncorr 


71 = 9 


27.2408 






Due to model 


P=1 


26.4441 


26.4441 


o* + ft(£X?) 


Residual 


00 

a, 

1 

£ 


.7967 


.0996 


a 2 



partition the total uncorrected sum of squares, 

Y? = 27.2408. 



In practice, the sum of squares due to the model is more easily computed 
as 

SS (Model) = X 2 ) 

= (.00156437) 2 (10, 805, 625) = 26.4441. 

The residual sum of squares is computed by difference. The summary anal- 
ysis of variance, including the mean square expectations, is given in Ta- 
ble 1.8. 

When the no-intercept model is appropriate, MS (Res) is an estimate of 
a 2 . MS (Model) is an estimate of a 2 plus a quantity that is positive if f3\ is 
not zero. The ratio of the two mean squares provides a test of significance 
for H 0 : f3\ = 0. This is an F-test with one and eight degrees of freedom, 
if the assumption of normality is valid, and is significant beyond a = .001. 
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There is clear evidence that the linear regression relating increased risk to 
dust exposure is not zero. 

The estimated variance of /3i is 



s 2 0i) 




.09958533 

10,805,625 



= 92.161 x 10“ 10 



or 



s0 1 ) = 9.6 x 10" 5 = .000096. 

Since each Y l is obtained by multiplying /% by the appropriate Xj, the 
estimated variance of a Yi is 

s\%) = X*[s\M 

= (92.161 x 10“ 10 )X 2 

if Yi is being used as an estimate of the true mean of Y for that value of X. 
If Yi is to be used for prediction of a future observation with dust exposure 
Xi, then the variance for prediction is 

s 2 (Y predi) = s 2 + s 2 (Yi) 

= .09958 + (92.161 x 1(T 10 )X 2 . 

The variances and the standard errors provide measures of precision of 
the estimate and are used to construct tests of hypotheses and conhdence 
interval estimates. 

The data and a plot of the fitted regression line are shown in Figure 1.3. 
The 95% conhdence interval estimates of the mean response £(Yj) are 
shown as bands on the regression line in the figure. Notice that with re- 
gression through the origin the conhdence bands go to zero as the origin is 
approached. This is consistent with the model assumption that the mean 
of Y is known to be zero when X = 0. Although the ht appears to be 
reasonable, there are suggestions that the model might be improved. The 
three lowest exposures fall below the regression line and very near zero; 
these levels of exposure may not be having as much impact as linear re- 
gression through the origin would predict. In addition, the largest residual, 
ej = .6663, is particularly noticeable. It is nearly twice as large as the 
next largest residual and is the source of over half of the residual sum of 
squares (see Table 1.7). This large positive residual and the overall pattern 
of residuals suggests that a curvilinear relationship without the origin being 
forced to be zero would provide a better ht to the data. In practice, such 
alternative models would be tested before this linear no-intercept model 
would be adopted. We forgo testing the need for a curvilinear relationship 
at this time (fitting curvilinear models is discussed in Chapters 3 and 8) 




= (Relative risk - 1) 
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FIGURE 1.3. Regression of increase in relative risk on exposure to dust particles 
with the regression forced through the origin. The bands on the regression line 
connect the limits of the 95% confidence interval estimates of the means. 
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and continue with this example to illustrate testing the appropriateness of 
the no-intercept model assuming the linear relationship is appropriate. 

The test of the assumption that po is zero is made by temporarily adopt- 
ing a model that allows a nonzero intercept. The estimate obtained for the 
intercept is then used to test the null hypothesis that Po is zero. Including 
an intercept in this example gives /3 0 = -0360 with s(p 0 ) = .1688. (The 
residual mean square from the intercept model is s 2 = .1131 with seven 
degrees of freedom.) The f-test for the null hypothesis that po is zero is 



.0360 

.1688 



.213 



and is not significant; t(. 025 , 7 ) = 2.365. There is no indication in these data 
that the no-intercept model is inappropriate. (Recall that this test has been 
made assuming the linear relationship is appropriate. If the model were 
expanded to allow a curvilinear response, the test of the null hypothesis that 
Po = 0 might become significant.) An equivalent test of the null hypothesis 
that p 0 = 0 can be made using the difference between the residual sums of 
squares from the intercept and no-intercept models. This test is discussed 
in Chapter 4. ■ 



1.8 Models with Several Independent Variables 

Most models will use more than one independent variable to explain the 
behavior of the dependent variable. The linear additive model can be ex- 
tended to include any number of independent variables: 



Yi — Po + PiXn + P 2 X -12 + PzX^ + • • ■ + P p X ip + £j. (1-49) 

The subscript notation has been extended to include a number on each A' 
and P to identify each independent variable and its regression coefficient. 
There are p independent variables and, including po, p' = p+ 1 parameters 
to be estimated. 

The usual least squares assumptions apply. The e, are assumed to be 
independent and to have common variance a 2 . For constructing tests of 
significance or confidence interval statements, the random errors are also 
assumed to be normally distributed. The independent variables are as- 
sumed to be measured without error. 

The least squares method of estimation applied to this model requires 
that estimates of the p + 1 parameters be found such that 

SS(Res) = - Yi) 2 

= y~^(T( — Po — Pi Xi 1 — p 2 Xii — ■ ■ ■ — p p X ip ) 2 (1.50) 
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is minimized. The (3j, j = 0,1,..., p, are the estimates of the parameters. 
The values of / 3j that minimize SS(Res) are obtained by setting the deriva- 
tive of SS(Res) with respect to each f3j in turn equal to zero. This gives 
(p + 1) normal equations that must be solved simultaneously to obtain the 
least squares estimates of the (p + 1) parameters. 

It is apparent that the problem is becoming increasingly difficult as the 
number of independent variables increases. The algebraic notation becomes 
particularly cumbersome. For these reasons, matrix notation and matrix 
algebra are used to develop the regression results for the more complicated 
models. The next chapter is devoted to a brief review of the key matrix 
operations needed for the remainder of the text. 



1.9 Violation of Assumptions 

In Section 1.1, we assumed that 



Yi — fio + fliXj + 6j, * — 1, . . . , n, 

where the random errors €i are normally distributed independent random 
variables with mean zero and constant variance er 2 , and the Xi are n ob- 
servations on the independent variable that is measured without error. 
Under these assumptions, the least squares estimators of /3 q and /3\ are the 
best (minimum variance) among all possible unbiased estimators. Statis- 
tical inference procedures, such as hypothesis testing and confidence and 
prediction intervals, considered in the previous section are valid under these 
assumptions. Here we briefly indicate the effects of violation of assumptions 
on estimation and statistical inference. A more detailed discussion of prob- 
lem areas in least squares and possible remedies is presented in Chapters 
10 through 14. 

Major problem areas in least squares analysis relate to failure of the four 
basic assumptions — normality, independence and constant variance of the 
errors, and the independent variable being measured without error. When 
only the assumption of normality is violated, the least squares estimators 
continue to have the smallest variance among all linear (in Y) unbiased 
estimators. The assumption of normality is not needed for the partitioning 
of total variation or for estimating the variance. However, it is needed for 
tests of significance and construction of confidence and prediction inter- 
vals. Although normality is a reasonable assumption in many situations, 
it is not appropriate for count data and for some time-to-failure data that 
tend to have asymmetric distributions. Transformations of the dependent 
variable and alternative estimation procedures are used in such situations. 
Also, in many situations with large n, statistical inference procedures based 
on t- and R-statistics are approximately valid, even though the normality 
assumption is not valid. 



Basic 

Assumptions 



Normality 
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When data are collected in a time sequence, the errors associated with 
an observation at one point in time will tend to be correlated with the 
errors of the immediately adjacent observations. Economic and meteoro- 
logical variables measured over time and repeated measurements over time 
on the same experimental unit, such as in plant and animal growth stud- 
ies, will usually have correlated errors. When the errors are correlated, the 
least squares estimators continue to be unbiased, but are no longer the best 
estimators. Also, in this case, the variance estimators obtained using equa- 
tions 1.26 and 1.32 are seriously biased. Alternative estimation methods 
for correlated errors are discussed in Chapter 12. 

In some situations, the variability in the errors increases with the inde- 
pendent variable or with the mean of the response variable. For example, 
in some yield data, the mean and the variance of the yield both increase 
with the amount of seeds (or fertilizer) used. Consider the model 

= (Po + Pi Xi)ui 

— Po + Pi Xi + (p 0 + (3\Xi )(ui — 1 ) 

= Pq + P\Xi + £j, 

where the errors u, are multiplicative and have mean one and constant 
variance. Then the variance of £j is proportional to (po + P\Xi) 2 . The 
effect of nonconstant (heterogeneous) variances on least squares estimators 
is similar to that of correlated errors. The least squares estimators are no 
longer efficient and the variance formulae in equations 1.26 and 1.32 are 
not valid. Alternative methods are discussed in Chapter 11. 

When the independent variable is measured with error or when the model 
is misspecified by omitting important independent variables, least squares 
estimators will be biased. In such cases, the variance estimators are also 
biased. Methods for detecting model misspecification and estimation in 
measurement error models are discussed in later chapters. Also, the effect 
of overly influential data points and outliers is discussed later. 



1.10 Summary 

This chapter has reviewed the basic elements of least squares estimation 
for the simple linear model containing one independent variable. The more 
complicated linear model with several independent variables was introduced 
and is pursued using matrix notation in subsequent chapters. The student 
should understand these concepts: 

• the form and basic assumptions of the linear model; 

• the least squares criterion, the estimators of the parameters obtained 
using this criterion, and measures of precision of the estimates; 



Correlated 

Errors 



Nonconstant 

Variance 



Measurement 

Error 
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• the use of the regression equation to obtain estimates of mean values 
and predictions, and appropriate measures of precision for each; and 

• the partitioning of the total variability of the response variable into 
that explained by the regression equation and the residual or unex- 
plained part. 



1.11 Exercises 

1.1. Use the least squares criterion to derive the normal equations, equa- 
tion 1.6, for the simple linear model of equation 1.2. 

1.2. Solve the normal equations, equation 1.6, to obtain the estimates of 
Po and p\ given in equation 1.7. 

1.3. Use the statistical model 

E = Po + PiXi + €i 

to show that e, ~ NID(0, a 2 ) implies each of the following: 

(a) S(y i )=p 0 +l3 1 X i , 

(b) <7 2 (Yj) = a 2 , and 

(c) Cov(Y i ,Y i ,) = (Wi'. 

For Parts (b) and (c), use the following definitions of variance and 
covariance. 

<7 2 (Y) = £{\Yi ^f(Yi)] 2 } 

Co v(Yi, YiO = £{[Y i -£(Y j )][Y i ,-£(Y i 0]}. 

1.4. The data in the accompanying table relate heart rate at rest Y to 
kilograms body weight X. 





x 


Y 






90 


62 






86 


45 






67 


40 






89 


55 






81 


64 






75 


53 




E E = 


488 


EE = 


319 


E x? = 


40, 092 


E E 2 = 


17,399 






E EE = 26, 184 
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(a) Graph these data. Does it appear that there is a linear relation- 
ship between body weight and heart rate at rest? 

(b) Compute /?o and /3i and write the regression equation for these 
data. Plot the regression line on the graph from Part (a). Inter- 
pret the estimated regression coefficients. 

(c) Now examine the data point (67, 40). If this data point were 
removed from the data set, what changes would occur in the 
estimates of /3 0 and /3i? 

(d) Obtain the point estimate of the mean of Y when X = 88. 
Obtain a 95% confidence interval estimate of the mean of Y 
when X = 88. Interpret this interval statement. 

(e) Predict the heart rate for a particular subject weighing 88kg 
using both a point prediction and a 95% confidence interval. 
Compare these predictions to the estimates computed in Part 
(d). 

(f) Without doing the computations, for which measured A' would 
the corresponding Y have the smallest variance? Why? 

1.5. Use the data and regression equation from Exercise 1.4 and compute 
Yj for each value of X. Compute the product moment correlations 
between 

(a) Xj and K;, 

(b) Yi and Yj, and 



(c) Xj and Yj. 



Compare these correlations to each other and to the coefficient of 
determination R 2 . Can you prove algebraically the relationships you 
detect? 

1.6. Show that 

SS(Model) = nY 2 + E(X» - X) 2 (equation 1.16). 

1.7. Show that 



£( y < - Yi ) 2 = E y 2 i ~ % E( x * - x ) 2 - 

Note that ^ yi is being used to denote the corrected sum of squares. 
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1.8. Show algebraically that ^ e* = 0 when the simple linear regression 
equation includes the constant term Pq. Show algebraically that this 
is not true when the simple linear regression does not include the 
intercept. 

1.9. The following data relate biomass production of soybeans to cumu- 
lative intercepted solar radiation over an eight-week period following 
emergence. Biomass production is the mean dry weight in grams of 
independent samples of four plants. (Data courtesy of Virginia Lesser 
and Dr. Mike Unsworth, North Carolina State University.) 



V 

Solar Radiation 


Y 

Plant Biomass 


29.7 


16.6 


68.4 


49.1 


120.7 


121.7 


217.2 


219.6 


313.5 


375.5 


419.1 


570.8 


535.9 


648.2 


641.5 


755.6 



(a) Compute Po and Pi for the linear regression of plant biomass on 
intercepted solar radiation. Write the regression equation. 

(b) Place 95% confidence intervals on Pi and /3 0 • Interpret the in- 
tervals. 

(c) Test H q : pi = 1.0 versus H a : Pi ^ 1.0 using a f-test with 
a = .1. Is your result for the f-test consistent with the confidence 
interval from Part (b)? Explain. 

(d) Use a f-test to test H 0 : p 0 = 0 against H a : p 0 ^ 0. Interpret 
the results. Now fit a regression with Pq = 0. Give the analysis of 
variance for the regression through the origin and use an E-test 
to test Ho : p 0 ^ 0. Compare the results of the t- test and the 
E-test. Do you adopt the model with or without the intercept? 

(e) Compute s 2 (/3i) for the regression equation without an inter- 
cept. Compare the variances of the estimates of the slopes Pi 
for the two models. Which model provides the greater precision 
for the estimate of the slope? 

(f) Compute the 95% confidence interval estimates of the mean 
biomass production for X = 30 and X = 600 for both the 
intercept and the no-intercept models. Explain the differences 
in the intervals obtained for the two models. 
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1.10. A linear regression was run on a set of data using an intercept and one 
independent variable. You are given only the following information: 

(1) Yi = 11.5 — 1.5.Xj. 

(2) The t-test for H 0 : /3i = 0 was nonsignificant at the a = .05 
level. A computed t of —4.087 was compared to t(.os, 2 ) from 
Appendix Table A.l. 

(3) The estimate of a 2 was s 2 = 1.75. 

(a) Complete the analysis of variance table using the given results. 

(b) Compute and interpret the coefficient of determination R 2 . 

1.11. An experiment has yielded sample means for four treatment regimes, 
Y i, Y 2 , Y 3 , and Y 4 . The numbers of observations in the four means 
are ni = 4, 712 = 6, n 3 = 3, and n± = 9. The pooled estimate of a 2 is 
s 2 = 23.5. 

(a) Compute the variance of each treatment mean. 

(b) Compute the variance of the mean contrast C = Y 3 + Y 4 — 2Yj. 

(c) Compute the variance of (Yi + Y 2 + Y 3 )/ 3. 

(d) Compute the variance of (4Yi + 6 Y 2 + 3Y 3 )/13. 

1.12. Obtain the normal equations and the least squares estimates for the 
model 

Yi = n + (ii Xi + €i f 

where x i = (X t — X). Compare the results to equation 1.6. (The 
model expressed in this form is referred to as the “centered” model; 
the independent variable has been shifted to have mean zero.) 

1.13. Recompute the regression equation and analysis of variance for the 
Heagle ozone data (Table 1.1) using the centered model, 

Yi =1-1 + filXi + 6j, 

where x, = ( X t — X). Compare the results with those in Tables 1.2 
to 1.4. 

1.14. Derive the normal equation for the no-intercept model, equation 1.40, 
and the least squares estimate of the slope, equation 1.41. 

1.15. Derive the variance of /?i and 1) for the no-intercept model. 

1.16. Show that 

- w* - F ) = = E w - y )- 
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1.17. The variance of Y pre( i 0 as given by equation 1.35 is for the prediction 
of a single future observation. Derive the variance of a prediction of 
the mean of q future observations all having the same value of X. 

1.18. An experimenter wants to design an experiment for estimating the 
rate of change in a dependent variable Y as an independent variable 
X is changed. He is convinced from previous experience that the 
relationship is linear in the region of interest, between X = 0 and 
X = 11. He has enough resources to obtain 12 observations. Use 
a 2 (/3i), equation 1.26, to show the researcher the best allocation of the 
design points (choices of A-values). Compare <j 2 (/3 i) for this optimum 
allocation with an allocation of one observation at each interger value 
of A' from X = 0 and X = 11. 

1.19. The data in the table relate seed weight of soybeans, collected for 
six successive weeks following the start of the reproductive stage, to 
cumulative seasonal solar radiation for two levels of chronic ozone 
exposure. Seed weight is mean seed weight (grams per plant) from 
independent samples of four plants. (Data courtesy of Virginia Lesser 
and Dr. Mike Unsworth.) 



Low Ozone 


High Ozone 


Radiation 


Seed Weight 


Radiation 


Seed Weight 


118.4 


.7 


109.1 


1.3 


215.2 


2.9 


199.6 


4.8 


283.9 


5.6 


264.2 


6.5 


387.9 


8.7 


358.2 


9.4 


451.5 


12.4 


413.2 


12.9 


515.6 


17.4 


452.5 


12.3 



(a) Determine the linear regression of seed weight on radiation sep- 
arately for each level of ozone. Determine the similarity of the 
two regressions by comparing the confidence interval estimates 
of the two intercepts and the two slopes and by visual inspection 
of plots of the data and the regressions. 

(b) Regardless of your conclusion in Part (a), assume that the two 
regressions are the same and estimate the common regression 
equation. 

1.20. A hotel experienced an outbreak of Pseudomona dermatis among its 
guests. Physicians suspected the source of infection to be the hotel 
whirlpool-spa. The data in the table give the number of female guests 
and the number infected by categories of time (minutes) spent in the 
whirlpool. 
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Time 

(Minutes) 


Number 
of Guests 


Number 

Infected 


0-10 


8 


1 


11-20 


12 


3 


21-30 


9 


3 


31-40 


14 


7 


41-50 


7 


4 


51-60 


4 


3 


61-70 


2 


2 



(a) Can the incidence of infection (number infected/number ex- 
posed) be characterized by a linear regression on time spent 
in the whirlpool? Use the midpoint of the time interval as the 
independent variable. Estimate the intercept and the slope, and 
plot the regression line and the data. 

(b) Review each of the basic assumptions of least squares regression 
and comment on whether each is satisfied by these data. 

1.21. Hospital records were examined to assess the link between smoking 
and duration of illness. The data reported in the table are the number 
of hospital days (per 1,000 person-years) for several classes of indi- 
viduals, the average number of cigarettes smoked per day, and the 
number of hospital days for control groups of nonsmokers for each 
class. (The control groups consist of individuals matched as nearly as 
possible to the smokers for several primary health factors other than 
smoking.) 



ff Hospital 
Days (Smokers) 


# Cigarettes 
Smoked/Day 


f) Hospital 
Days (Nonsmokers) 


215 


10 


201 


185 


5 


180 


334 


15 


297 


761 


45 


235 


684 


25 


520 


368 


30 


210 


1275 


50 


195 


3190 


45 


835 


3520 


60 


435 


428 


20 


312 


575 


5 


590 


2280 


45 


1131 


2795 


60 


225 
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(a) Plot the logarithm of number of hospital days (for the smokers) 
against number of cigarettes. Do you think a linear regression 
will adequately represent the relationship? 

(b) Plot the logarithm of number of hospital days for smokers minus 
the logarithm of number of hospital days for the control group 
against number of cigarettes. Do you think a linear regression 
will adequately represent the relationship? Has subtraction of 
the control group means reduced the dispersion? 

(c) Define Y = ln(# days for smokers)— In (# days for nonsmokers) 
and X = (^cigarettes) 2 . Fit the linear regression of Y on X. 
Make a test of significance to determine if the intercept can 
be set to zero. Depending on your results, give the regression 
equation, the standard errors of the estimates, and the summary 
analysis of variance. 

1.22. Use the normal equations in 1.6 to show that 

(a) Y.XiYi = Y.XiYi. 

(b) Ex iei =o. 

(c) Yh^i e i = 0- (Hint: use Exercise 1.8). 

1.23 Consider the regression through the origin model in equation 1.39. 

Suppose Xi > 0. Define pi = Y)/ and Pi = ]T) Xf. 

(a) Show that pi and pi are unbiased for /3i . 

(b) Compare the variances of /3i and Pi. 
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INTRODUCTION TO MATRICES 



Chapter 1 reviewed simple linear regression in alge- 
braic notation and showed that the notation for models 
involving several variables is very cumbersome. 

This chapter introduces matrix notation and all matrix 
operations that are used in this text. Matrix algebra 
greatly simplifies the presentation of regression and is 
used throughout the text. Sections 2.7 and 2.8 are not 
used until later in the text and can be omitted for now. 



Matrix algebra is extremely helpful in multiple regression for simplify- 
ing notation and algebraic manipulations. You must be familiar with the 
basic operations of matrices in order to understand the regression results 
presented. A brief introduction to the key matrix operations is given in 
this chapter. You are referred to matrix algebra texts, for example, Searle 
(1982), Searle and Hausman (1970), or Stewart (1973), for more complete 
presentations of matrix algebra. 



2.1 Basic Definitions 

A matrix is a rectangular array of numbers arranged in orderly rows and 
columns. Matrices are denoted with boldface capital letters. The following 



Matrix 
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are examples. 



Z = 



1 

6 

5 



2 

4 

7 



B = 



15 

15 



7 -1 
5 -2 



0 

10 



X = 



5- 

6 

4 

9 

2 

6 . 



The numbers that form a matrix are called the elements of the matrix. A 
general matrix could be denoted as 



Oil 


012 


n 


021 


022 


® , 2n 


- O m l 


Om2 


® mn 



The subscripts on the elements denote the row and column, respectively, 
in which the element appears. For example, (Z23 is the element found in the 
second row and third column. The row number is always given first. 

The order of a matrix is its size given by the number of rows and 
columns. The first matrix given, Z , is of order (3, 2). That is, Z is a 
3x2 matrix, since it has three rows and two columns. Matrix A is an 
m x n matrix. 

The rank of a matrix is defined as the number of linearly independent 
columns (or rows) in the matrix. Any subset of columns of a matrix are 
linearly independent if no column in the subset can be expressed as a 
linear combination of the others in the subset. The matrix 



A 



1 2 4 

3 0 6 

5 3 13 



contains a linear dependency among its columns. The first column multi- 
plied by two and added to the second column produces the third column. 
In fact, any one of the three columns of A can be written as a linear com- 
bination of the other two columns. On the other hand, any two columns of 
A are linearly independent since one cannot be produced as a multiple of 
the other. Thus, the rank of the matrix A , denoted by r{A), is two. 

If there are no linear dependencies among the columns of a matrix, the 
matrix is said to be of full rank, or nonsingular. If a matrix is not of 
full rank it is said to be singular. The number of linearly independent 
rows of a matrix will always equal the number of linearly independent 
columns. The linear dependency among the rows of A is shown by 9(rowl)+ 
7(row2) = 6(row3). The critical matrices in regression will almost always 



Elements 



Order 



Rank 



Full-Rank 

Matrices 
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have fewer columns than rows and, therefore, rank is more easily visualized 
by inspection of the columns. 

The collection of all linear combinations of columns of A is called the 
column space of A or the space spanned by the columns of A. 



2.2 Special Types of Matrices 



A vector is a matrix having only one row or one column, and is called a 
row or column vector, respectively. Although vectors are often designated 
with boldface lowercase letters, this convention is not followed rigorously in 
this text. A boldface capital letter is used to designate a data vector and a 
boldface Greek letter is used for vectors of parameters. Thus, for example, 

/T. , , , 

v = is a 4 x 1 column vector. 

VJ 

H = ( pi P 2 P 3 ) is a 1 x 3 row vector. 



We usually define the vectors as column vectors but they need not be. A 
single number such as 4, —2.1, or 0 is called a scalar. 

A square matrix has an equal number of rows and columns. 

D = 



2 4 
6 7 



is a 2 x 2 square matrix. 



A diagonal matrix is a square matrix in which all elements are zero ex- 
cept the elements on the main diagonal, the diagonal of elements, an, 022 , 
. . . , a nn , running from the upper left postion to the lower right position. 



A 



5 0 0 
0 4 0 
0 0 8 



is a 3 x 3 diagonal matrix. 



An identity matrix is a diagonal matrix having all the diagonal ele- 
ments equal to 1; such a matrix is denoted by I n . The subscript identifies 
the order of the matrix and is omitted when the order is clear from the 
context. 



h 



1 0 0 
0 1 0 
0 0 1 



is a 3 x 3 identity matrix. 



After matrix multiplication is discussed, it can be verified that multiplying 
any matrix by the identity matrix will not change the original matrix. 



Column Space 



Vector 



Square 

Matrix 



Diagonal 

Matrix 



Identity 

Matrix 
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A symmetric matrix is a square matrix in which element Ojj equals 
element Oj,; for all i and j. The elements form a symmetric pattern around 
the diagonal of the matrix. 



A 




is a 3 x 3 symmetric matrix. 



Note that the first row is identical to the first column, the second row is 
identical to the second column, and so on. 



2.3 Matrix Operations 



The transpose of a matrix A, designated A', is the matrix obtained by 
using the rows of A as the columns of A' . If 



A 



‘I 2- 

3 8 

4 1 ’ 

.5 9 . 



the transpose of A is 



A' 



13 4 5 
2 8 19 



If a matrix A has order mxn , its transpose A ' has order nxm. A symmetric 
matrix is equal to its transpose: A' = A. 

Addition of two matrices is defined if and only if the matrices are of 
the same order. Then, addition (or subtraction) consists of adding (or sub- 
tracting) the corresponding elements of the two matrices. For example, 



1 


2' 


+ 


' 7 


-6 ' 


3 


8 


8 


2 



8 -4 
11 10 



Addition is commutative: A + B = B + A. 

Multiplication of two matrices is defined if and only if the number of 
columns in the first matrix equals the number of rows in the second matrix. 
If A is of order rxs and B is of order mxn , the matrix product AB exists 
only if s = m. The matrix product BA exists only if r = n. Multiplication 
is most easily defined by first considering the multiplication of a row vector 
times a column vector. Let a' = (or 02 03) and b' = (61 62 63). 

(Notice that both a and b are defined as column vectors.) Then, the product 
of a' and b is 



/ \ 

a'b = (a 1 a 2 03 ) I b 2 \ 

\&3 / 

= a\bi + a 2 b 2 + a 3 b 3 . 



Symmetric 

Matrix 



Transpose 



Addition 



Multiplication 



( 2 . 1 ) 
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The result is a scalar equal to the sum of products of the corresponding 
elements. Let 



a 1 = (3 6 1 ) and b' = ( 2 4 8 ) . 

The matrix product is 



a'b = (3 6 1) 4 = 6 + 24 + 8 = 38 . 

V 8 / 



Matrix multiplication is defined as a sequence of vector multiplications. 
Write 



A = 



Oil 


Ol 2 


O 13 


as A = 


f a i\ 


021 


022 


023 




\«2 ) 



where a! x = ( on 012 013 ) and a 2 = ( 021 022 <223 ) are the 1 x 3 row 

vectors in A. Similarly, write 



B 



611 &12 
&21 &22 
631 632 



as B = ( bi b 2 ) , 



where bi and b 2 are the 3 x 1 column vectors in B. Then the product of 
A and B is the 2 x 2 matrix 



AB = C 



ai&i 


a[b 2 




Cll 


C12 


a' 2 b! 


a' 2 b 2 _ 




c 2 i 


C22 



where 



Cll 



C12 



C21 



C22 



ai&i 



a[b 2 



a' 2 bi 



a 2 b 2 



3 

3 = 1 

3 

3 = 1 
3 



3 = 1 

3 



^02j&j2 
3 = 1 



Oll&ll + Ol2&21 + 013^31 



anfci 2 + ai 2 b 22 + 013632 



021611 + 022621 + 023631 



021&12 + 022622 + 023632- 



( 2 . 2 ) 



In general, element is obtained from the vector multiplication of the 
?'th row vector from the first matrix and the jth column vector from the 
second matrix. The resulting matrix C has the number of rows equal to 
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the number of rows in A and number of columns equal to the number of 
columns in B. 



Let 



T 



1 2 
4 5 
3 0 



and W = 




Example 2.1 



The product WT is not defined since the number of columns in W is not 
equal to the number of rows in T. The product TW , however, is defined: 



TW 



/(l)(-l) + (2)(3)\ 
(4)(-l) + (5)(3) 

\ (3)(— 1) + (0)(3)/ 



5 

11 

-3 



The resulting matrix is of order 3x1 with the elements being determined 
by multiplication of the corresponding row vector from T with the column 
vector in W . ■ 



Matrix multiplication is not commutative; AB does not necessarily equal 
BA even if both products exist. As for the matrices W and T in Example 
2.1, the matrices are not of the proper order for multiplication to be defined 
in both ways. The first step in matrix multiplication is to verify that the 
matrices do conform (have the proper order) for multiplication. 

The transpose of a product is equal to the product in reverse order of 
the transposes of the two matrices. That is, 

(AB)' = B'A'. (2.3) 



The transpose of the product of T and W from Example 2.1 is 



(TW)' = W'T' 



(-1 3 ) l 



4 3 

5 0 



(5 11 -3). 



Scalar multiplication is the multiplication of a matrix by a single 
number. Every element in the matrix is multiplied by the scalar. Thus, 



'2 


1 


7' 




' 6 


3 


21 ' 


3 


5 


9 




9 


15 


27 



The determinant of a matrix is a scalar computed from the elements of Determinant 
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the matrix according to well-defined rules. Determinants are defined only 
for square matrices and are denoted by |A|, where A is a square matrix. 
The determinant of a 1 x 1 matrix is the scalar itself. The determinant of 
a 2 x 2 matrix, 



«n ai 2 
021 a 22 



is defined as 






011022 — 012 021 - 



For example, if 



the determinant of A is 



A = 



1 6 
-2 10 ’ 



(2.4) 



|A| = (1) (10) — (6)(— 2) = 22. 

The determinants of higher-order matrices are obtained by expanding 
the determinants as linear functions of determinants of 2 x 2 submatrices. 
First, it is convenient to define the minor and the cofactor of an element 
in a matrix. Let A be a square matrix of order n. For any element a rs 
in A, a square matrix of order (n — 1) is formed by eliminating the row 
and column containing the element a rs ■ Label this matrix A rs , with the 
subscripts designating the row and column eliminated from A. Then |A rs |, 
the determinant of A rs , is called the minor of the element a rs . The product 
9 rs = (— l) r+s | A rs | is called the cofactor of a rs . Each element in a square 
matrix has its own minor and cofactor. 

The determinant of a matrix of order n is expressed in terms of the ele- 
ments of any row or column and their cofactors. Using row i for illustration, 
we can express the determinant of A as 

n 

|A| = CLjjOjj , (2-5) 

j = i 

where each dij contains a determinant of order (n — 1). Thus, the deter- 
minant of order n is expanded as a function of determinants of one less 
order. Each of these determinants, in turn, is expanded as a linear function 
of determinants of order (n — 2). This substitution of determinants of one 
less order continues until | A| is expressed in terms of determinants of 2 x 2 
submatrices of A. 

The first step of the expansion is illustrated for a 3 x 3 matrix A. To 
compute the determinant of A, choose any row or column of the matrix. 
For each element of the row or column chosen, compute the cofactor of the 
element. Then, if the zth row of A is used for the expansion, 



| A| — dildn + CLi'2@i2 + 



( 2 . 6 ) 
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For illustration, let 



A = 



2 4 6 
1 2 3 
5 7 9 



and use the first row for the expansion of |A|. The cofactors of the elements 
in the first row are 



2 3 
7 9 



1 3 
5 9 

1 2 

5 7 



9n = (- 1 ) (1+1) 

012 = (-l) <1+2) 

013 = (-l) (1+3) 

Then, the determinant of A is 

\A\ = 2(— 3) + 4(6) + 6(— 3) = 0 



= (18-21) = -3, 

= — (9 — 15) = 6, and 
= (7-10) = -3. 



If the determinant of a matrix is zero, the matrix is singular, or it is 
not of full rank. Otherwise, the matrix is nonsingular. Thus, the matrix 
A in Example 2.2 is singular. The linear dependency is seen by noting that 
row 1 is equal to twice row 2. The determinants of larger matrices rapidly 
become difficult to compute and are obtained with the help of a computer. 

Division in the usual sense does not exist in matrix algebra. The concept 
is replaced by multiplication by the inverse of the matrix. The inverse of 
a matrix A, designated by A~ , is defined as the matrix that gives the 
identity matrix when multiplied by A. That is, 

A~ 1 A = AA” 1 = I. (2.7) 

The inverse of a matrix may not exist. A matrix has a unique inverse if 
and only if the matrix is square and nonsingular. A matrix is nonsingular 
if and only if its determinant is not zero. 

The inverse of a 2 x 2 matrix is easily computed. If 



then 



A = 



an ai2 

021 a 22 



A' 1 



1 

A 



O 22 — Ol2 
— 021 Oil 



( 2 . 8 ) 



Example 2.2 



Inverse of 
a Matrix 
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Note the rearrangement of the elements and the use of the determinant of 
A as the scalar divisor. For example, if 







2 3 


'4 3" 




5 5 




, then A~ = 




1 2 




1 4 






5 5 



That this is the inverse of A is verified by multiplication of A and A 1 



AA 





2 3 






4 3 




5 5 




'1 o' 


1 2 




1 4 




0 1 


1 


5 5 







The inverse of a matrix is obtained in general by (1) replacing every 
element of the matrix with its cofactor, (2) transposing the resulting matrix, 
and (3) dividing by the determinant of the original matrix, as illustrated 
in the next example. 



Consider the following matrix, 



B = 



The determinant of B is 

IB! = 1 



5 6 
7 9 



1 3 2 
4 5 6 
8 7 9 



- 3 



4 6 
8 9 



+ 2 



= (45 - 42) - 3(36 - 48) + 2(28 - 40) 
= 15. 



Example 2.3 



The unique inverse of B exists since |B| ^ 0. The cofactors for the elements 
of the first row of B were used in obtaining |B| : 9 \\ = 3, 612 = 12, 613 = 
— 12. The remaining cofactors are: 



021 = 
031 = 



3 

5 



3 

7 

2 

6 




022 = 



1 2 
8 9 






1 



3 

5 




= - 7 . 



Thus, the matrix of cofactors is 



3 12 -12 

-13 -7 17 

8 2-7 
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and the inverse of B is 

3-13 8 

12-7 2 . 

-12 17 -7 

Notice that the matrix of cofactors has been transposed and divided by 
|B| to obtain B _1 . It is left as an exercise to verify that this is the inverse 
of B. As with the determinants, computers are used to find the inverses of 
larger matrices. ■ 




Note that if A is a diagonal nonsingular matrix, then A 1 is also a 
diagonal matrix where the diagonal elements of A -1 are the reciprocals of 



the diagonal elements of A. That is, if 






a n 


0 


0 ••• 


0 




0 


022 


0 ••• 


0 


A = 


0 


0 


a 33 


0 




0 


0 


0 ••• 


Onn 


where an 0, then 


a n 1 


0 


0 ■■■ 


0 




0 


a 22 


0 ••• 


0 


A — 


0 


0 


a 33 


0 




0 


0 


0 -■ 


Onn _ 


Also, if A and B are two nonsingular matrices, 


then 



A 


0 


-1 


A” 1 


0 


0 


B 




0 


B~ 1 



2.4 Geometric Interpretations of Vectors 

The elements of an n x 1 vector can be thought of as the coordinates of a 
point in an n-dimensional coordinate system. The vector is represented in 
this n-space as the directional line connecting the origin of the coordinate 
system to the point specified by the elements. The direction of the vector 
is from the origin to the point; an arrowhead at the terminus indicates 
direction. 

To illustrate, let x' = ( 3 2). This vector is of order two and is plotted 

in two-dimensional space as the line vector going from the origin (0, 0) to 



Inverse of 
a Diagonal 
Matrix 



Vector 

Length 
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FIGURE 2.1. The geometric representation of the vectors x' = (3, 2) and 
w' = (2, —1) in two-dimensional space. 

the point (3, 2) (see Figure 2.1). This can be viewed as the hypotenuse of a 
right triangle whose sides are of length 3 and 2, the elements of the vector 
x. The length of x is then given by the Pythagorean theorem as the square 
root of the sum of squares of the elements of x. Thus, 

length)*) = \/3 2 + 2 2 = = 3.61. 

This result extends to the length of any vector regardless of its order. 
The sum of squares of the elements in a column vector x is given by (the 
matrix multiplication) x'x. Thus, the length of any vector x is 

length)*) = s/x'x. (2-9) 

Multiplication of * by a scalar defines another vector that falls precisely 
on the line formed by extending the vector * indefinitely in both directions. 
For example, 

u' = (-1)*' = (-3 -2) 

falls on the extension of * in the negative direction. Any point on this indef- 
inite extension of * in both directions can be “reached” by multiplication 
of * with an appropriate scalar. This set of points constitutes the space 
defined by *, or the space spanned by *. It is a one-dimensional subspace 
of the two-dimensional space in which the vectors are plotted. A single 



Space 

Defined by * 
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FIGURE 2.2. Geometric representation of the sum of two vectors. 

vector of order n defines a one-dimensional subspace of the n-dimensional 
space in which the vector falls. 

The second vector w' = ( 2 1), shown in Figure 2.1 with a dotted 

line, defines another one-dimensional subspace. The two subspaces defined 
by x and w are disjoint subspaces (except for the common origin). The 
two vectors are said to be linearly independent since neither falls in 
the subspace defined by the other. This implies that one vector cannot be 
obtained by multiplication of the other vector by a scalar. 

If the two vectors are considered jointly, any point in the plane can be 
“reached’' by an appropriate linear combination of the two vectors. For 
example, the sum of the two vectors gives the vector y (see Figure 2.2), 

y' = x' + w' = { 3 2 ) + ( 2 — 1 ) = ( 5 1). 

The two vectors x and w define, or span, the two-dimensional subspace 
represented by the plane in Figure 2.2. Any third vector of order 2 in this 
two-dimensional space must be a linear combination of x and w. That is, 
there must be a linear dependency among any three vectors that fall on 
this plane. 

Geometrically, the vector x is added to w by moving x, while maintaining 
its direction, until the base of x rests on the terminus of w. The resultant 
vector y is the vector from the origin (0, 0) to the new terminus of x. The 
same result is obtained by moving w along the vector x. This is equivalent 
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to completing the parallelogram using the two original vectors as adjacent 
sides. The sum y is the diagonal of the parallelogram running from the 
origin to the opposite corner (see Figure 2.2). Subtraction of two vectors, 
say w' — x\ is most easily viewed as the addition of w ' and (—*'). 

Vectors of order 3 are considered briefly to show the more general be- 
havior. Each vector of order 3 can be plotted in three-dimensional space; 
the elements of the vector define the endpoint of the vector. Each vector 
individually defines a one-dimensional subspace of the three-dimensional 
space. This subspace is formed by extending the vector indefinitely in both 
directions. Any two vectors define a two-dimensional subspace if the two 
vectors are linearly independent — that is, as long as the two vectors do 
not define the same subspace. The two-dimensional subspace defined by 
two vectors is the set of points in the plane defined by the origin and the 
endpoints of the two vectors. The two vectors defining the subspace and 
any linear combination of them lie in this plane. 

A three-dimensional space contains an infinity of two-dimensional sub- 
spaces. These can be visualized by rotating the plane around the origin. 
Any third vector that does not fall in the original plane will, in conjunction 
with either of the first two vectors, define another plane. Any three linearly 
independent vectors, or any two planes, completely define, or span, the 
three-dimensional space. Any fourth vector in that three-dimensional sub- 
space must be a linear function of the first three vectors. That is, any four 
vectors in a three-dimensional subspace must contain a linear dependency. 

The general results are stated in the box: 



1. Any vector of order n can be plotted in n-dimensional space and 
defines a one-dimensional subspace of the n-dimensional space. 

2. Any p linearly independent vectors of order n, p < n, define a p- 
dimensional subspace. 

3. Any p + 1 vectors in a p-dimensional subspace must contain a linear 
dependency. 



Two vectors x and w of the same order are orthogonal vectors if the 
vector product 



x'w = w'x = 0. 
If- 



( n 




( 3 \ 


1 

^ ^ o 


and w = 


4 

-1 

l -1 / 



then x and w are orthogonal because 



(2.10) 



x'w = (1)(3) + (0)(4) + (—!)(—!) + (4) ( 1) = 0. 
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Geometrically, two orthogonal vectors are perpendicular to each other or 
they form a right angle at the origin. 

Two linearly dependent vectors form angles of 0 or 180 degrees at the 
origin. All other angles reflect vectors that are neither orthogonal nor lin- 
early dependent. In general, the cosine of the angle a between two (column) 
vectors x and w is 



, , x'w 

cos(a) = — (2.11) 

v X xy w w 

If the elements of each vector have mean zero, the cosine of the angle 
formed by two vectors is the product moment correlation between the 
two columns of data in the vectors. Thus, orthogonality of two such vectors 
corresponds to a zero correlation between the elements in the two vectors. If 
two such vectors are linearly dependent, the correlation coefficient between 
the elements of the two vectors will be either +1.0 or —1.0 depending on 
whether the vectors have the same or opposite directions. 



2.5 Linear Equations and Solutions 



A set of r linear equations in s unknowns is represented in matrix notation 
as Ax = y , where a; is a vector of the s unknowns, A is the rxs matrix of 
known coefficients on the s unknowns, and y is the rxl vector of known 
constants on the right-hand side of the equations. 

A set of equations may have (1) no solution, (2) a unique solution, or (3) 
an infinite number of solutions. In order to have at least one solution, the 
equations must be consistent. This means that any linear dependencies 
among the rows of A must also exist among the corresponding elements of 
y (Searle and Hausman, 1970). For example, the equations 



1 2 3 

2 4 6 

3 3 3 




6 

10 

9 



are inconsistent since the second row of A is twice the first row but 
the second element of y is not twice the first element. Since they are not 
consistent, there is no solution to this set of equations. Note that x' = 
(1 1 1 ) satisfies the first and third equations but not the second. If the 

second element of y were 12 instead of 10, the equations would be consistent 
and the solution x' = ( 1 1 1 ) would satisfy all three equations. 
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One method of determining if a set of equations is consistent is to com- 
pare the rank of A to the rank of the augmented matrix [A y]. The equa- 
tions are consistent if and only if 

r[A] ~ r('A y). (2.12) 

Rank can be determined by using elementary (row and column) operations 
to reduce the elements below the diagonal to zero. Operations such as 
addition of two rows, interchanging rows, and obtaining a scalar multiple 
of a row are called elementary row operations. (In a rectangular matrix, 
the diagonal is defined as the elements an, 022, • ■ ■ , add , where d is the 
lesser of the number of rows and number of columns.) The number of rows 
with at least one nonzero element after reduction is the rank of the matrix. 



Elementary operations on 



A 



1 2 3 

2 4 6 

3 3 3 



give 



A* = 



1 2 3 

0 -3 -6 
0 0 0 



so that r(A) = 2. [The elementary operations to obtain A* are (1) sub- 
tract 2 times row 1 from row 2, (2) subtract 3 times row 1 from row 3, 
and (3) interchange rows 2 and 3.J The same elementary operations, plus 
interchanging columns 3 and 4, on the augmented matrix 



give 



l A y } 



12 3 6 

2 4 6 10 

3 3 3 9 



[A «]* 



12 6 3 

0 -3 -9 -6 
0 0-20 



Thus, r([A y]) = 3. Since r{[A y]) 7^ r(A), the equations are not consistent 
and, therefore, they have no solution. ■ 



Consistent equations either have a unique solution or an infinity of solu- 
tions. If r(A) equals the number of unknowns, the solution is unique and 
is given by 



Consistent 

Equations 



Example 2.4 



Unique 

Solution 




52 



2. INTRODUCTION TO MATRICES 



1. x = A 1 y , when A is square; or 

2. x = Aj^y, where Ai is a full rank submatrix of A , when A is 
rectangular. 



The equations Ax = y with 



A 



1 2 

3 3 
5 7 



and y 



6 

9 

21 



are consistent. (Proof of consistency is left as an exercise.) The rank of A 
equals the number of unknowns [r(A) = 2], so that the solution is unique. 
Any two linearly independent equations in the system of equations can be 
used to obtain the solution. Using the first two rows gives the full-rank 
equations 

'1 2 
3 3 




with the solution 




Notice that the solution x' = ( 0 3 ) satisfies the third equation also. ■ 



When r(A) in a consistent set of equations is less than the number of 
unknowns, there is an infinity of solutions. 



Consider the equations Ax = y with 



"1 


2 


3' 




< 6 \ 


2 


4 


6 


and y = j 


12 


3 


3 


3 




V 9/ 



The rank of A is r(A) = 2 and elementary operations on the augmented 
matrix [A y] give 



l A v r 



12 3 6 

0 -3 -6 -18 
0 0 0 0 



Thus, r([A y]) = 2, which equals r(A), and the equations are consistent. 
However, r(A) is less than the number of unknowns so that there is an 
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infinity of solutions. This infinity of solutions comes from the fact that one 
element of x can be chosen arbitrarily and the remaining two chosen so 
as to satisfy the set of equations. For example, if x\ is chosen to be 1, the 
solution is x' = ( 1 1 1 ), whereas if x\ is chosen to be 2, the solution is 

x' = {2 -1 2). ■ 

A more general method of finding a solution to a set of consistent equa- 
tions involves the use of generalized inverses. There are several defini- 
tions of generalized inverses [see Searle (1971), Searle and Hausman (1970), 
and Rao (1973)]. An adequate definition for our purposes is the following 
(Searle and Hausman, 1970). 

A generalized inverse of A is any matrix A~ that satisfies the 

condition AA~ A = A. 

( A~ is used to denote a generalized inverse.) The generalized inverse is not 
unique (unless A is square and of full rank, in which case A~ = A -1 ). A 
generalized inverse can be used to express a solution to a set of consistent 
equations Ax = y as x = A~y. This solution is unique only when r(A) 
equals the number of unknowns in the set of equations. (The computer is 
used to obtain generalized inverses when needed.) 



For illustration, consider the set of consistent equations Ax = y where 





'l 


2' 




( 6\ 


A = 


3 

5 


3 

7 


and y = | 


^ 21 / 



It has been shown that r(A) = 2 which equals the number of unknowns so 
that the solution is unique. A generalized inverse of A is 



-10 16 -4 

8 -11 5 

and the unique solution is given by 





It is left as an exercise to verify the matrix multiplication of Ay and that 

AA~A = A. ‘ m 



For another illustration, consider again the consistent equations Ax = y 
from Example 2.6, where 



A = 


'l 

2 


2 3' 
4 6 


and y = | 


( 6 \ 
, 12 




3 


3 3 




^ 9/ 
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This system has been shown to have an infinity of solutions. A generalized 
inverse of A is 



J_ __2_ 4 

10 10 9 



A~ 



0 0 | 



JL _ 2 _ 2 

10 10 9 



which gives the solution 

x = A~y = ( 1 1 1 )' . 



This happens to agree with the first solution obtained in Example 2.6. 
Again, it is left as an exercise to verify that x = A~y and AA~ A = A. 
A different generalized inverse of A may lead to another solution of the 
equations. ■ 



2.6 Orthogonal Transformations and Projections 

The linear transformation of vector x to vector y , both of order n, is 
written as y = Ax, where A is the nxn matrix of coefficients effecting the 
transformation. The transformation is a one-to-one transformation only if 
A is nonsingular. Then, the inverse transformation of y to x is x = A~ l y. 

A linear transformation is an orthogonal transformation if AA' = I. 
This condition implies that the row vectors of A are orthogonal and of unit 
length. Orthogonal transformations maintain distances and angles between 
vectors. That is, the spatial relationships among the vectors are not changed 
with orthogonal transformations. 



For illustration, let y[ 



(3 10 20), y' 2 = {6 14 21), and 



Then 



Xi 



A 



1 1 1 

-1 0 1 

-1 2 -1 



Ay i 



1 1 1 

-1 0 1 

-1 2 -1 




and 



x 2 = Ay^ 



1 1 1 

-1 0 1 

-1 2 -1 




33 

17 

-3 
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are linear transformations of y l to X\ and y 2 to x 2 . These are not orthog- 
onal transformations because 



AA' 



3 

0 

0 



0 0 
2 0 
0 6 






The rows of A are mutually orthogonal (the off-diagonal elements are zero) 
but they do not have unit length. This can be made into an orthogonal 
transformation by scaling each row vector of A to have unit length by 
dividing each vector by its length. Thus, 



= A*y,= 



1 


i 


1 




( 


33 


\ 


V3 


V3 


V3 






V3 




1 

V2 


0 


1 

V2 


Vi = 




17 

^2 




1 _ 


2 


1 




V 


3 


/ 


Ve 




v'e _ 




V6 



and 



= A*y 2 = 



15 

^2 

\7Ej 



are orthogonal transformations. It is left as an exercise to verify that the 
orthogonal transformation has maintained the distance between the two 
vectors; that is, verify that 



( ?/i - y 2 )'(y i - y 2 ) = ( x t - * 2 )'(*i - x V> = 26. 



[The squared distance between two vectors u and v is {u — v)'(u — v).] ■ 



Projection of a vector onto a subspace is a special case of a transforma- 
tion. (Projection is a key step in least squares.) The objective of a projec- 
tion is to transform y in n-dimensional space to that vector y in a subspace 
such that y is as close to y as possible. A linear transformation of y to y, 
y = Py , is a projection if and only if P is idempotent and symmetric 
(Rao, 1973), in which case P is referred to as a projection matrix. 

An idempotent matrix is a square matrix that remains unchanged when 
multiplied by itself. That is, the matrix A is idempotent if AA = A. It can 
be verified that the rank of an idempotent matrix is equal to the sum of the 
elements on the diagonal (Searle, 1982; Searle and Hausman, 1970). This 
sum of elements on the diagonal of a square matrix is called the trace of 
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the matrix and is denoted by tr(A). Symmetry is not required for a matrix 
to be idempotent. However, all idempotent matrices with which we are 
concerned are symmetric. 

The subspace of a projection is defined, or spanned, by the columns or 
rows of the projection matrix P. If P is a projection matrix, (I — P) is also 
a projection matrix. However, since P and (I—P) are orthogonal matrices, 
the projection by (I — P) is onto the subspace orthogonal to that defined 
by P. The rank of a projection matrix is the dimension of the subspace 
onto which it projects and, since the projection matrix is idempotent, the 
rank is equal to its trace. 



The matrix 



is idempotent since 




5 2-1 
2 2 2 
-12 5 



AA = A 2 




2 -1 
2 2 
2 5 

2 -1 
2 2 
2 5 



5 

2 

-1 



-1 

2 

5 



= A. 



The rank of A is given by 

r(A) = tr(A) = i(5 + 2 + 5) = 2. 



Example 2.10 



Since A is symmetric, it is also a projection matrix. Thus, the linear 
transformation 



y 



Ay i = 



1 

6 



5 2-1 
2 2 2 
-12 5 



3 

10 

20 



2.5 

11.0 

19.5 



is a projection of y 1 = ( 3 10 20 ) onto the subspace defined by the 

columns of A. The vector y is the unique vector in this subspace that 
is closest to y l . That is, (y 1 — y)'(y i — y) is a minimum. Since A is a 
projection matrix, so is 





'l 


0 


o ' 


1 

~~ 6 


5 


2 


-1 ' 


1 

““ 6 


1 


-2 


1 ' 


I- A = 


0 


1 


0 


2 


2 


2 


-2 


4 


-2 




0 


0 


1 


-1 


2 


5 


1 


-2 


1 
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Then, 



e = (I~A)y j = ^ 



1 

-2 

1 




is a projection onto the subspace orthogonal to the subspace defined by A. 
Note that y e = 0 and y + e = y 1 . ■ 



2.7 Eigenvalues and Eigenvectors 

Eigenvalues and eigenvectors of matrices are needed for some of the meth- 
ods to be discussed, including principal component analysis, principal com- 
ponent regression, and assessment of the impact of collinearity (see Chap- 
ter 13). Determining the eigenvalues and eigenvectors of a matrix is a dif- 
ficult computational problem and computers are used for all but the very 
simplest cases. However, the reader needs to develop an understanding of 
the eigenanalysis of a matrix. 

The discussion of eigenanalysis is limited to real, symmetric, nonneg- 
ative definite matrices and, then, only key results are given. The reader 
is referred to other texts [such as Searle and Hausman (1970)] for more 
general discussions. In particular, Searle and Hausman (1970) show sev- 
eral important applications of eigenanalysis of asymmetric matrices. Real 
matrices do not contain any complex numbers as elements. Symmetric, 
nonnegative definite matrices are obtained from products of the type 
B' B and, if used as defining matrices in quadratic forms (see Chapter 4), 
yield only zero or positive scalars. 

It can be shown that for a real, symmetric matrix A (n x n) there 
exists a set of n scalars Aj, and n nonzero vectors z t , i = 1 , . . . , n, such 
that 



Azi = \Zi , 

or Azi - A tZi = 0, 

or ( A — Xil)zi = 0 , i = l,...,n. (2.13) 

The A,; are the n eigenvalues (characteristic roots or latent roots) of the 
matrix A and the z t are the corresponding (column) eigenvectors (char- 
acteristic vectors or latent vectors). 

There are nonzero solutions to equation 2.13 only if the matrix (A — A; J) 
is less than full rank — that is, only if the determinant of (A — A; I) is zero. 
The A i are obtained by solving the general determinantal equation 

\A-\I\ = 0. (2.14) 
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Since A is of order n x n, the determinant of (A — XI) is an nth degree 
polynomial in A. Solving this equation gives the n values of A, which are not 
necessarily distinct. Each value of A is then used in turn in Equation 2.13 
to find the companion eigenvector Z;. 

When the eigenvalues are distinct, the vector solution to Equation 2.13 
is unique except for an arbitrary scale factor and sign. By convention, each 
eigenvector is defined to be the solution vector scaled to have unit length; 
that is, z[zi = 1. Furthermore, the eigenvectors are mutually orthogonal; 
z[zj = 0 when i 7 ^ j. When the eigenvalues are not distinct, there is an 
additional degree of arbitrariness in defining the subsets of vectors corre- 
sponding to each subset of nondistinct eigenvalues. Nevertheless, the eigen- 
vectors for each subset can be chosen so that they are mutually orthogonal 
as well as orthogonal to the eigenvectors of all other eigenvalues. Thus, if 
Z = ( z\ Z2 ■ ■ ■ Zji ) is the matrix of eigenvectors, then Z' Z = I . This 
implies that Z' is the inverse of Z so that ZZ' = / as well. 

Using Z and L , defined as the diagonal matrix of the A j, we can write 
the initial equations Azi = A iZi as 



AZ = 


ZL , 


(2.15) 


or Z'AZ = 


L, 


(2.16) 


or A = 


ZLZ' . 


(2.17) 



Equation 2.17 shows that a real symmetric matrix A can be transformed to 
a diagonal matrix by pre- and postmultiplying by Z' and Z , respectively. 
Since L is a diagonal matrix, equation 2.17 shows that A can be expressed 
as the sum of matrices: 

A = ZLZ' = J2 X *(ziz'i), (2.18) 

where the summation is over the n eigenvalues and eigenvectors. Each term 
is an n x n matrix of rank 1 so that the sum can be viewed as a decompo- 
sition of the matrix A into n matrices that are mutually orthogonal. Some 
of these may be zero matrices if the corresponding A i are zero. The rank of 
A is revealed by the number of nonzero eigenvalues A j. 



For illustration, consider the matrix 

A = 



10 3 
3 8 



The eigenvalues of A are found by solving the determinantal equation 
(equation 2.14), 



|(A-AI)| = 






[10- A 


3 1 


3 


' 

1 

00 



= 0 
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or 

(10 — A)(8 — A) — 9 = A 2 — 18A + 71 = 0. 

The solutions to this quadratic (in A) equation are 

Ai = 12.16228 and A 2 = 5.83772 

arbitrarily ordered from largest to smallest. Thus, the matrix of eigenvalues 
of A is 

_ T 12.16228 0 

“ |_ 0 5.83772 ' 

The eigenvector corresponding to Ai = 12.16228 is obtained by solving 
equation 2.13 for the elements of Z\: 

{A- 12.162287) j = 0 

or 

'-2.162276 3 1 ( *n A _ n 

3 -4.162276 J \z 2 i )~ " 

Arbitrarily setting zn = 1 and solving for z 2 i, using the first equation, 
gives z 2 1 = .720759. Thus, the vector z[ = (1 .720759) satisfies the first 

equation (and it can be verified that it also satisfies the second equation). 

Rescaling this vector so it has unit length by dividing by 

length(zi) = x/z^z! = ^1.5194935 = 1.232677 

gives the first eigenvector 

z i = (.81124 .58471)'. 

The elements of z 2 are found in the same manner to be 

z 2 = ( -.58471 .81124)'. 

Thus, the matrix of eigenvectors for A is 

_ [ .81124 -.58471 
Z “ |_ .58471 .81124 ' 

Notice that the first column of Z is the first eigenvector, and the second 
column is the second eigenvector. ■ 

Continuing with Example 2.11, notice that the matrix A is of rank two Example 2.12 
because both eigenvalues are nonzero. The decomposition of A into two 
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orthogonal matrices each of rank one, A = A 1 + A 2 , equation 2.18, is 
given by 



A i = Aizizi 



12.16228 



.81124 \ 
.58471 ) 



(.81124 



.58471 ) 



8.0042 5.7691 
5.7691 4.1581 



and 



A 2 — X2Z2Z2 



1.9958 -2.7691 
-2.7691 3.8419 



Since the two columns of A\ are multiples of the same vector vl\, they are 
linearly dependent and, therefore, r(Ai) = 1. Similarly, r(A 2 ) = 1. Multi- 
plication of A\ with A 2 shows that the two matrices are orthogonal to each 
other: A\ A 2 = 0 , where 0 is a 2 x 2 matrix of zeros. Thus, the eigenanalysis 
has decomposed the rank-2 matrix A into two rank-1 matrices. It is left as 
an exercise to verify the multiplication and that A\ + A 2 = A. ■ 



Notice that the sum of the eigenvalues in Example 2.11, Ai + A 2 = 18, is 
equal to tr(A). This is a general result: the sum of the eigenvalues for any 
square symmetric matrix is equal to the trace of the matrix. Furthermore, 
the trace of each of the component rank-1 matrices is equal to its eigenvalue: 

tr(Ai) = Ai and tr(A 2 ) = A 2 . 

Note that for A = B' B, we have 

z[Azi = A iZ^Zi 

and 

_ z^Azi _ z^B’Bzi 
1 “ Z'-Zi ~ Z'-Zi 

c'jCj 

/ ? 

Z\Zi 

where c.; = Bzj. Therefore, if A = B 1 B for some real matrix B, then the 
eigenvalues of A are nonnegative. Symmetric matrices with nonnegative 
eigenvalues are called nonnegative definite matrices. 



2.8 Singular Value Decomposition 

The eigenanalysis, Section 2.7, applies to a square symmetric matrix. In 
this section, the eigenanalysis is used to develop a similar decomposition, 
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called the singular value decomposition, for a rectangular matrix. The 
singular value decomposition is then used to give the principal compo- 
nent analysis. 

Let X be an n x p matrix with n > p. Then X X is a square symmetric 
matrix of order p x p. From Section 2.7, X'X can be expressed in terms 
of its eigenvalues L and eigenvectors Z as 

X'X = ZLZ'. (2.19) 

Here L is a diagonal matrix consisting of eigenvalues Ai, . . . , X p of X'X. 
From Section 2.7, we know that Ai, . . . , A p are nonnegative. Similarly, XX' 
is a square symmetric matrix but of order n x n. The rank of XX 1 will 
be at most p so there will be at most p nonzero eigenvalues; they are in 
fact the same p eigenvalues obtained from X'X. In addition, XX 1 will 
have at least n — p eigenvalues that are zero. These n — p eigenvalues and 
their vectors are dropped in the following. Denote with U the matrix of 
eigenvectors of XX' that correspond to the p eigenvalues common to X'X. 
Each eigenvector it,; will be of order nxl. Then, 

XX' = ULU'. (2.20) 

Equations 2.19 and 2.20 jointly imply that the rectangular matrix X can 
be written as 



X = UL 1 / 2 Z\ (2.21) 

where L : ' 2 is the diagonal matrix of the positive square roots of the p 
eigenvalues of X'X. Thus, L 1 ^ 2 L 1 ^ 2 = L. Equation 2.21 is the singular- 

value decomposition of the rectangular matrix X. The elements of L 1 ^ 2 , 
1/2 

\ are called the singular values and the column vectors in U and Z 
are the left and right singular vectors, respectively. 

Since L 1 / 2 is a diagonal matrix, the singular value decomposition ex- 
presses X as a sum of p rank-1 matrices, 

X = 8 1/2 u^, (2.22) 

where summation is over i = 1, . . . , p. Furthermore, if the eigenvalues have 
been ranked from largest to smallest, the first of these matrices is the 
“best” rank-1 approximation to X, the sum of the first two matrices is 
the “best” rank-2 approximation of X, and so forth. These are “best” 
approximations in the least squares sense; that is, no other matrix (of the 
same rank) will give a better agreement with the original matrix X as 
measured by the sum of squared differences between the corresponding 
elements of X and the approximating matrix (Householder and Young, 
1938). The goodness of fit of the approximation in each case is given by 
the ratio of the sum of the eigenvalues (squares of the singular values) 
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used in the approximation to the sum of all eigenvalues. Thus, the rank-1 
approximation has a goodness of fit of Ai / y~) A,;, the rank-2 approximation 
has a goodness of fit of (Ai + A 2 )/ and so forth. 

Recall that there is an arbitrariness of sign for the eigenvectors obtained 
from the eigenalysis of X'X and XX' . Thus, care must be exercised in 
choice of sign for the eigenvectors in reconstructing X or lower-order ap- 
proximations of X when the left and right eigenvectors have been obtained 
from eigenanalyses. This is not a problem when U and Z have been ob- 
tained directly from the singular value decomposition of X . 



Singular value decomposition is illustrated using data on average mini- 
mum daily temperature X\, average maximum daily temperature X 2 , total 
rainfall X 3 , and total growing degree days X4, for six locations. The data 
were reported by Saeed and Francis (1984) to relate environmental con- 
ditions to cultivar by environment interactions in sorghum and are used 
with their kind permission. Each variable has been centered to have zero 
mean, and standardized to have unit sum of squares. (The centering and 
standardization are not necessary for a singular value decomposition. The 
centering removes the mean effect of each variable so that the dispersion 
about the mean is being analyzed. The standardization puts all variables 
on an equal basis and is desirable in most cases, particularly when the 
variables have different units of measure.) The X matrix is 



Xi x 2 x 3 


x 4 ) 






.178146 - 


.523245 


.059117 


-.060996 


.449895 - 


.209298 


.777976 


.301186 


-.147952 


.300866 


-.210455 


-.053411 


-.057369 


.065406 


.120598 


-.057203 


-.782003 - 


.327028 


-.210455 


-.732264 


.359312 


.693299 


-.536780 


.602687 



The singular value decomposition of X into UL l, ~ Z' gives 



U = 



L 1 / 2 = 



-.113995 


.308905 


-.810678 


.260088 


.251977 


.707512 


.339701 


-.319261 


.007580 


-.303203 


.277432 


.568364 


-.028067 


.027767 


.326626 


.357124 


-.735417 


-.234888 


.065551 


-.481125 


.617923 


-.506093 


-.198632 


-.385189 


L. 496896 


0 


0 


0 - 




0 


1.244892 


0 


0 




0 


0 


.454086 


0 




0 


0 


0 .057893 J 





Example 2.13 
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'.595025 .336131 .383204 .621382' 

_ .451776 .540753 .657957 .265663 

.004942 .768694 .639051 .026450 ' 

..664695 .060922 .108909 .736619. 

The columns of U and Z are the left and right singular vectors, respectively. 
The first column of U, Ui, the first column of Z, z i, and the first singular 
value, Ai = 1.496896, give the best rank-1 approximation of X , 






\ 1/2 / 
A/ u\z 1 



( -.1140 \ 
.2520 
.0076 
-.0281 
-.7354 



( .5950 



(1.4969) 



V 

-.101535 

.224434 

.006752 

-.024999 

-.655029 

.550378 



6179 ) 

-.077091 

.170403 

.005126 

-.018981 

-.497335 

.417877 



.4518 



-.000843 

.001864 

.000056 

-.000208 

-.005440 

.004571 



.0049 .6647) 



-.113423 

.250712 

.007542 

-.027927 

-.731725 

.614820 



The goodness of fit of A\ to X is measured by 
Aj _ (1.4969) 2 _ 

EE “ i 



or the sum of squares of the differences between the elements of X and 
A i, the lack of fit, is 44% of the total sum of squares of the elements in X. 
This is not a very good approximation. 

The rank-2 approximation to X is obtained by adding to A-\ the matrix 
A -2 = \^ 2 u 2 Z 2 . This gives 



A\ + A 2 



.027725 

.520490 

-.120122 

-.013380 

-.753317 

.338605 



-.285040 

-.305880 

.209236 

-.037673 

-.339213 

.758568 



.295197 

.678911 

-.290091 

.026363 

-.230214 

-.479730 



-.089995 

.304370 

-.015453 

-.025821 

-.749539 

.576438 



which has goodness of fit 



Ai + A 2 _ (1.4969) 2 + (1.2449) 2 

E A* 



4 



= . 95 . 
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In terms of approximating X with the rank-2 matrix A 1 +A 2 , the goodness 
of fit of .95 means that the sum of squares of the discrepancies between 
X and (A± + A 2 ) is 5% of the total sum of squares of all elements in X. 
The sum of squares of all elements in X is ^ Ai, the sum of squares of all 
elements in (Ai + A 2 ) is (Ai + A 2 ), and the sum of squares of all elements in 
[X — (Ai + A 2 )] is (A 3 + A 4 ). In terms of the geometry of the data vectors, 
the goodness of fit of .95 means that 95% of the dispersion of the “cloud” 
of points in the original four-dimensional space is, in reality, contained in 
two dimensions, or the points in four-dimensional space very nearly fall on 
a plane. Only 5% of the dispersion is lost if the third and fourth dimensions 
are ignored. 

Using all four singular values and their singular vectors gives the com- 
plete decomposition of X into four orthogonal rank-1 matrices. The sum of 
the four matrices equals X, within the limits of rounding error. The anal- 
ysis has shown, by the relatively small size of the third and fourth singular 
values, that the last two dimensions contain little of the dispersion and can 
safely be ignored in interpretation of the data. ■ 



The singular value decomposition is the first step in principal com- 
ponent analysis. Using the result X = UL 1/2 Z ' and the property that 
Z' Z = J, one can define the n x p matrix W as 

W = XZ = UL 1/2 . (2.23) 

The first column of Z is the first of the right singular vectors of X, or 
the first eigenvector of X' X . Thus, the coefficients in the first eigenvector 
define the particular linear function of the columns of X (of the original 
variables) that generates the first column of W. The second column of W 
is obtained using the second eigenvector of X'X, and so on. Notice that 
W'W = L. Thus, W is an n x p matrix that, unlike X, has the property 
that all its columns are orthogonal. (L is a diagonal matrix so that all 
off-diagonal elements, the sums of products between columns of W , are 
zero.) The sum of squares of the ith column of W is Ai, the ?'th diagonal 
element of L. Thus, if X is an n x p matrix of observations on p variables, 
each column of W is a new variable defined as a linear transformation of 
the original variables. The zth new variable has sum of squares A; and all 
are pairwise orthogonal. This analysis is called the principal component 
analysis of X, and the columns of W are the principal components 
(sometimes called principal component scores). 

Principal component analysis is used where the columns of X correspond 
to the observations on different variables. The transformation is to a set 
of orthogonal variables such that the first principal component accounts 
for the largest possible amount of the total dispersion, measured by Ai, the 
second principal component accounts for the largest possible amount of the 
remaining dispersion A 2 , and so forth. The total dispersion is given by the 



Principal 

Component 

Analysis 
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sum of all eigenvalues, which is equal to the sum of squares of the original 
variables; tr(X'X) = tr (W'W) = £ A*. 

For the Saeed and Francis data, Example 2.13, each column of Z contains 
the coefficients that define one of the principal components as a linear 
function of the original variables. The first vector in Z , 

z\ = ( .5950 .4518 .0049 .6647)', 

has similar first, second, and fourth coefficients with the third coefficient 
being near zero. Thus, the first principal component is essentially an aver- 
age of the three temperature variables Xj, X 2 , and X 4 . The second column 
vector in Z , 

z 2 = (.3361 -.5408 .7687 .0609)', 

gives heavy positive weight to X 3 , heavy negative weight to X 2 , and mod- 
erate positive weight to Xi. Thus, the second principal component will be 
large for those observations that have high rainfall X 3 , and small difference 
between the maximum and minimum daily temperatures X 2 and X 4 . 

The third and fourth principal components account for only 5% of the to- 
tal dispersion. This small amount of dispersion may be due more to random 
“noise” than to real patterns in the data. Consequently, the interpretation 
of these components may not be very meaningful. The third principal com- 
ponent will be large when there is high rainfall and large difference between 
the maximum and minimum daily temperatures, 

z 3 = ( —.3832 .6580 .6391 -.1089)'. 

The variable degree days X 4 has little involvement in the second and third 
principal components; the fourth coefficient is relatively small. The fourth 
principal component is determined primarily by the difference between an 
average minimum daily temperature and degree days, 

z 4 = ( .6214 .2657 -.0265 -.7366)'. ■ 



The principal component vectors are obtained either by the multiplica- 
tion W = UL 1/2 or W = XZ. The first is easier since it is the simple 

1 /2 

scalar multiplication of each column of U with the appropriate A ?: ' . 



Example 2.14 



The principal component vectors for the Saeed and Francis data of Ex- Example 2.15 
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ample 2.13 are (with some rounding) 



-.1706 


.3846 


-.3681 


.0151 


.3772 


.8808 


.1543 


-.0185 


.0113 


-.3775 


.1260 


.0329 


-.0420 


.0346 


.1483 


.0207 


-1.1008 


-.2924 


.0298 


-.0279 


.9250 


-.6300 


-.0902 


-.0223 



The sum of squares of the first principal component, the first column of W, 
is Ai = (1.4969) 2 = 2.2407. Similarly, the sums of squares for the second, 
third, and fourth principal components are 

A 2 = (1.2449) 2 = 1.5498 

A 3 = (.4541) 2 = .2062 

A 4 = (.0579) 2 = .0034. 

These sum to 4.0, the total sum of squares of the original three variables 
after they were standardized. The proportion of the total sum of squares 
accounted for by the first principal component is Ai/ A» = 2.2407/4 = .56 
or 56%. The first two principal components account for (Ai + A 2 )/4 = 
3.79/4 = .95 or 95% of the total sum of squares of the four original variables. 

Each of the original data vectors in X was a vector in six-dimensional 
space and, together, the four vectors defined a four-dimensional subspace. 
These vectors were not orthogonal. The four vectors in W , the principal 
component vectors, are linear functions of the original vectors and, as such, 
they fall in the same four-dimensional subspace. The principal component 
vectors, however, are orthogonal and defined such that the first principal 
component vector has the largest possible sum of squares. This means that 
the direction of the first principal component axis coincides with the major 
axis of the elipsoid of observations, Figure 2.3. Note that the “cloud” of 
observations, the data points, does not change; only the axes are being 
redefined. The second principal component has the largest possible sum 
of squares of all vectors orthogonal to the first, and so on. The fact that 
the first two principal components account for 95% of the sum of squares 
in this example shows that very little of the dispersion among the data 
points occurs in the third and fourth principal component dimensions. In 
other words, the variability among these six locations in average minimum 
and average maximum temperature, total rainfall, and total growing degree 
days, can be adequately described by considering only the two dimensions 
(or variables) defined by the first two principal components. 

The plot of the first two principal components from the Saeed and Fran- 
cis data, Figure 2.3, shows that locations 5 and 6 differ from each other 
primarily in the first principal component. This component was noted ear- 
lier to be mainly a temperature difference; location 6 is the warmer and has 
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First principal component 

FIGURE 2.3. The first two principal components of the Saeed and Francis ( 1 98 4 ) 
data on average minimum temperature, average maximum temperature, total rain- 
fall, and growing degree days for six locations. The first principal component pri- 
marily reflects average temperature. The second principal component is a measure 
of rainfall minus the spread between minimum and maximum temperature. 
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the longer growing season. The other four locations differ primarily in the 
second principal component which reflects amount of rainfall and the dif- 
ference in maximum and minimum temperature. Location 2 has the highest 
rainfall and tends to have a large difference in maximum and minimum daily 
temperature. Location 6 is also the lowest in the second principal compo- 
nent indicating a lower rainfall and small difference between the maximum 
and minimum temperature. Thus, location 6 appears to be a relatively hot, 
dry environment with somewhat limited diurnal temperature variation. ■ 



2.9 Summary 

This chapter has presented the key matrix operations that are used in 
this text. The student must be able to use matrix notation and matrix 
operations. Of particular importance are 

• the concepts of rank and the transpose of a matrix; 

• the special types of matrices: square, symmetric, diagonal, identity, 
and idempotent; 

• the elementary matrix operations of addition and multiplication; and 

• the use of the inverse of a square symmetric matrix to solve a set of 
equations. 

The geometry of vectors and projections is useful in understanding least 
squares principles. Eigenanalysis and singular value decomposition are used 
later in the text. 



2.10 Exercises 

2.1. Let 



1 


o' 












1 


2 


-1' 


2 


4 


, B = 




-1 






0 


3 


-4 


2 











c' — ( 1 2 0 ) , and d = 2, a scalar. 

Perform the following operations, if possible. If the operation is not 
possible, explain why. 



(a) c! A 

(b) A' c 
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(c) B' + A 

(d) c’B 

(e) A — d 

(f) (dB' + A). 

2.2. Find the rank of each of the following matrices. Which matrices are 
of full rank? 



1 


0 


0 


0- 




-1 


1 


0 


0- 


0 


1 


0 


0 


B = 


1 


0 


1 


0 


0 


0 


1 


0 




1 


0 


0 


1 


0 


0 


0 


1 . 




.1 


0 


0 


0 . 



110 0 
10 10 

10 0 1 

1 -1 -1 -1 



2.3. Use B in Exercise 2.2 to compute D = B(B'B) 1 B I . Determine 
whether D is idempotent. What is the rank of D‘! 

2.4. Find ay elements to make the following matrix symmetric. Can you 
choose 033 to make the matrix idempotent? 



A = 



1 2 013 4 

2 — 1 0 024 

6 0 a 33 -2 

4i 8 — 2 3 



2.5. Verify that A and B are inverses of each other. 

fin Kl r 2 



2.6. Find 641 such that a and b are orthogonal. 



2.7. Plot the following vectors on a two-dimensional coordinate system. 



By inspection of the plot, which pairs of vectors appear to be orthog- 
onal? Verify numerically that they are orthogonal and that all other 
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pairs in this set are not orthogonal. Explain from the geometry of 
the plot how you know there is a linear dependency among the three 
vectors. 



2.8. The three vectors in Exercise 2.7 are linearly dependent. Find the 
linear function of V\ and v 2 that equals v 3 . Set the problem up as a 
system of linear equations to be solved. Let V ~ (vi v 2 ), and let 
x' = ( x\ x 2 ) be the vector of unknown coefficients. Then, Vx = v 3 
is the system of equations to be solved for x. 

(a) Show that the system of equations is consistent. 

(b) Show that there is a unique solution. 

(c) Find the solution. 

2.9. Expand the set of vectors in Exercise 2.7 to include a fourth vector, 

r >4 = ( 8 5). Reformulate Exercise 2.8 to include the fourth vector 

by including V 4 in Wand an additional coefficient in x. Is this system 
of equations consistent? Is the solution unique? Find a solution. If 
solutions are not unique, find another solution. 



2.10. Use the determinant to determine which of the following matrices has 
a unique inverse. 



A = 



1 1 
4 10 





3 

2 



2.11. Given the following matrix, 



A = 



3 y/2 
V2 2 



(a) find the eigenvalues and eigenvectors of A. 

(b) What do your findings tell you about the rank of A? 



2.12. Given the following eigenvalues with their corresponding eigenvectors, 
and knowing that the original matrix was square and symmetric, 
reconstruct the original matrix. 




2.13. Find the inverse of the following matrix, 



A = 



5 

0 

0 



0 0 
10 2 
2 3 
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2.14. Let 



-1 


.2 


o - 


/242\ 


1 


.4 


0 




240 


1 


.6 


0 




236 


1 


.8 


0 


y 


230 


1 


.2 


.1 


I — 


239 


1 


.4 


.1 




238 


1 


.6 


.1 




231 


.1 


.8 


. 1 . 




\2267 



(a) Compute XX and X'Y . Verify by separate calculations that 
the (i,j) = (2,2) element in X' X is the sum of squares of 
column 2 in X. Verify that the (2,3) element is the sum of 
products between columns 2 and 3 of X . Identify the elements 
in X'Y in terms of sums of squares or products of the columns 
of X and Y 

(b) Is X of full column rank? What is the rank of X' XI 

(c) Obtain {X' X) -1 . What is the rank of (X'X) -1 ? Verify by ma- 
trix multiplication that (X' X) -1 X' X = I. 

(d) Compute P = X(X' X) -1 X' and verify by matrix multiplica- 
tion that P is idempotent. Compute the trace tr(P). What is 
r(P)? 

2.15. Use X as defined in Exercise 2.14. 

(a) Find the singular value decomposition of X . Explain what the 
singular values tell you about the rank of X . 

(b) Compute the rank-1 approximation of X ; call it A\. Use the 
singular values to state the “goodness of fit” of this rank-1 ap- 
proximation. 

(c) Use A\ to compute a rank-1 approximation of X'X; that is, 
compute A[Ai. Compare tr(X , 1 Ai) with Ai and tr(X'X). 

2.16. Use X'X as computed in Exercise 2.14. 

(a) Compute the eigenanalysis of X'X. What is the relationship 
between the singular values of X obtained in Exercise 2.15 and 
the eigenvalues obtained for X' X'! 

(b) Use the results of the eigenanalysis to compute the rank-1 ap- 
proximation of X'X. Compare this result to the approximation 
of X'X obtained in Exercise 2.15. 

(c) Show algebraically that they should be identical. 
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2.17. Verify that 




3 -13 8 

12 -7 2 

-12 17 -7 



is the inverse of 



B 



1 3 2 
4 5 6 
8 7 9 



2.18. Show that the equations Ax = y are consistent where 



A = 



1 

3 

5 



2 

3 

7 



and y 




2.19. Verify that 





16 -4 
-11 5 



is a generalized inverse of 



A = 



1 

3 

5 



2 

3 

7 



2.20. Verify that 



A~ = 



is a generalized inverse of 



J_ _ 2_ 4 

10 10 9 

0 0 I 

J_ 2_ _ 2 

10 10 9 



A = 



1 

2 

3 



2 3 
4 6 

3 3 



2.21. Use the generalized inverse in Exercise 2.20 to obtain a solution to 

the equations Ax = y, where A is dehned in Exercise 2.20 and y = 
(6 12 9 /. Verify that the solution you obtained satisfies Ax = y. 

2.22. The eigenanalysis of 
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in Section 2.7 gave 






8.0042 5.7691 
5.7691 4.1581 



and A 2 



1.9958 -2.7691 
-2.7691 3.8419 



Verify the multiplication of the eigenvectors to obtain Ai and A 2 . 
Verify that Ai + A 2 = A, and that Ai and A 2 are orthogonal to 
each other. 



2.23. In Section 2.6, a linear transformation of y 1 = ( 3 10 20 ) to Xi = 

(33 17 — 3 )' and of y 2 = ( 6 14 21 / to x 2 = (41 15 1 )' was 

made using the matrix 



A = 



1 1 1 

-1 0 1 

-1 2 -1 



The vectors of A were then standardized so that A' A = I to produce 
the orthogonal transformation of y l and y 2 to 

x\ = (33/\/3 17/ — 3/\/6 )' 

and 

*2 = (41/V3 15/v^2 l/VQ)\ 

respectively. Show that the squared distance between y 1 and y 2 is 
unchanged when the orthogonal transformation is made but not when 
the nonorthogonal transformation is made. That is, show that 

ivi - y 2 )'(yi - y 2 ) = (*i - * 2 )'(*i - *2) 



but that 



(yi - y 2 )'(yi - y 2 ) + (*i - * 2 )'(*i - *2)- 



2.24. (a) Let A be an m x n matrix and Bbeannxm matrix. Then show 
that tr(AB) = tr(BA). 

(b) Use (a) to show that tr(ABC') = tr(BCA), where Cisanmxm 
matrix. 

2.25. Let a* be an m x 1 vector with a*' a* > 0. Define a = a* / (a* 1 a*) 1 / 2 
and A = aa' . Show that A is a symmetric idempotent matrix of rank 

1. 

2.26. Let a and b be two m x 1 vectors that are orthogonal to each other. 
Define A = aa' and B = bb' . Show that AB = BA = 0, a matrix 
of zeros. 
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2.27. Gram— Schmidt orthogonalization. An orthogonal basis for a space 
spanned by some vectors can be obtained using the Gram-Schmidt 
orthogonalization procedure. 

(a) Consider two linearly independent vectors and v 2 . Define 
Z\ = «i and z 2 = v 2 — ViC 2 .i, where c 2 .i = {v[v 2 )/{v' 1 vi). 
Show that Z\ and z 2 are orthogonal. Also, show that Z\ and z 2 
span the same space as V\ and v 2 . 

(b) Consider three linearly independent vectors V\. v 2 , and v 3 . De- 
hne Z\ and z 2 as in (a) and z 3 = v 3 — c 3 ,iZi — c 3 , 2 z 2 , where 
c-i.i = (z'n 3 )/(z'zj ), i = 1, 2. Show that z i, z 2 , and z 3 are 
mutually orthogonal and span the same space as v 2 , and v 3 . 




3 

MULTIPLE REGRESSION IN 
MATRIX NOTATION 



We have reviewed linear regression in algebraic nota- 
tion and have introduced the matrix notation and op- 
erations needed to continue with the more complicated 
models. 

This chapter presents the model, and develops the nor- 
mal equations and solution to the normal equations for 
a general linear model involving any number of inde- 
pendent variables. The matrix formulation for the vari- 
ances of linear functions is used to derive the measures 
of precision of the estimates. 

Chapter 1 provided an introduction to multiple regression and suggested 
that a more convenient notation was needed. Chapter 2 familiarized you 
with matrix notation and operations with matrices. This chapter states 
multiple regression results in matrix notation. Developments in the chapter 
are for full rank models. Less than full rank models that use generalized 
inverses are discussed in Chapter 9. 



3.1 The Model 

The linear additive model for relating a dependent variable to p indepen- 
dent variables is 



Yi — A) + + ■ ■ ' + fipXip + Cj. 



(3.1) 
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The subscript i denotes the observational unit from which the observations 
on Y and the p independent variables were taken. The second subscript 
designates the independent variable. The sample size is denoted with n,i = 
1, ...,n, and p denotes the number of independent variables. There are 
( p + 1) parameters (3j, j = 0, . . . ,p to be estimated when the linear model 
includes the intercept (3q. For convenience, we use p' = (p+ 1). In this book 
we assume that n > p' . Four matrices are needed to express the linear 
model in matrix notation: 

Y : the nxl column vector of observations on the dependent variable Yp. 

X: the n x p' matrix consisting of a column of ones, which is labeled 1, 
followed by the p column vectors of the observations on the indepen- 
dent variables; 

1 3 : the p' x 1 vector of parameters to be estimated; and 
e: the nxl vector of random errors. 

With these definitions, the linear model can be written as 

Y = X/3 + e, (3.2) 

or 



/Yl\ 




■ 1 Xu X V2 X 13 ■ ■ 


■■ *1„1 








/G\ 


Y 2 


= 


1 X 21 X 22 X 23 


V 2p 




+ 


£2 


\Y n J 




- 1 X n i X n2 X n3 ■ ■ 






V/V 







(n x 1) (n x p') ( p ' x 1) (n x 1) 

Each column of X contains the values for a particular independent variable. 
The elements of a particular row of X, say row r, are the coefficients on 
the corresponding parameters in (3 that give £{Y r ). Notice that /3 0 has the 
constant multiplier 1 for all observations; hence, the column vector 1 is the 
first column of X. Multiplying the first row of X by (3 , and adding the 
first element of e confirms that the model for the first observation is 

Fi = /3o + Ai^ii + A 2 -X 12 + ■ ■ ■ + P v Xi p + 61 . 

The vectors Y and e are random vectors; the elements of these vectors are 
random variables. The matrix X is considered to be a matrix of known 
constants. A model for which X is of full column rank is called a full-rank 
model. 

The vector /3 is a vector of unknown constants to be estimated from the 
data. Each element 0j is a partial regression coefficient reflecting the change 
in the dependent variable per unit change in the jth independent variable, 
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The X Matrix 



The (3 Vector 




